GENI Distributed Services Preliminary Requirements and Design

GENI Distributed Services Preliminary Requirements and Design
Tom Anderson and Amin Vahdat (co-chairs) David Andersen, Mic Bowman, Frans Kaashoek, Arvind Krishnamurthy, Yoshi Kohno, Rick McGeer, Vivek Pai, Mark Segal, Mike Reiter, Mothy Roscoe, Ion Stoica

Distributed Services Work Status
Work split into subgroups: Security Architecture (Mike Reiter, Tom Anderson, Yoshi Kohno, Arvind Krishnamurthy) Edge cluster definition (Mic Bowman, Tom Anderson) Storage (Frans Kaashoek, Dave Andersen, Mic Bowman) Resource Allocation (Amin Vahdat, Rick McGeer) Experiment Support (Amin Vahdat, Arvind Krishnamurthy) Operations Support (Mark Segal, Vivek Pai) Communications Substrate (Arvind Krishnamurthy, Amin Vahdat, Tom Anderson) Legacy Systems (Tom Anderson)

Each section progressed against a defined sequence: Overview Requirements description Preliminary design Related work discussion Modules and dependencies identified WBS estimate Every part of the design subject to change as science goals are refined, additional information gathered Including during construction

Overall state: Rationale/design needs better documentation and an independent review Modules identified/initial WBS estimates completed Need clarity from the GSC as to prioritization Specifics Security: Design solid; user scenarios needed Edge Node/Cluster: Requirements in flux depending on budget issues; moved to GMC Storage: Requirements solid; modules identified Resource allocation: Design solid; user scenarios needed Experimenter support: User experience needed to drive requirements Operations support: Requirements outlined Communication Services: Requirements outlined Legacy Support: Requirements outlined

Facility Software Architecture
achieve system-wide properties such as security, reproducibility, .. Distributed Services name space for users, slices, & components set of interfaces (“plug in” new components) support for federation (“plug in” new partners) GMC provide ability to virtualize and isolate components in a way meaningful to expts Substrate Components

Facility Software Architecture
At hardware device level, component manager, virtualization and isolation layer Minimal layer (GENI management core) to provide basic building blocks Robustness of this layer is critical to the entire project, so keep it small, simple, well-defined Avoid “big bang” integration effort Services layered on top of GMC to provide system-level requirements Modular to allow independent development and evolution As technology progresses, post-GENI efforts can replace these services piece by piece

User Centric View Researchers Operations staff
Ease of describing, launching, and managing experiments Network-level, not node-level Operations staff Administrative cost of managing the facility Resource owners (hardware contributors) Policy knobs to express priorities, security policy for the facility System developers (software contributors) GENI developers and the broader research community building tools that enhance GENI End users Researchers and the public Goal of distributed services group is to make the system more useful, not more powerful

Principal Concerns Security and isolation
Operational cost and manageability Usability and experiment flexibility Scalability, robustness, performance Experiment development cost Construction cost and schedule Policy neutrality: avoid binding policy decisions into GENI architecture

Topics Security architecture Edge cluster hardware/software definition
Storage services Resource allocation Experiment support Operations support Communications substrate Legacy Internet applications support

Security Architecture
What is the threat model? What are the goals/requirements? Access control Authentication and key management Auditing Operator/administrative interfaces

Threat model Exploitation of a slice Exploitation of GENI itself
Runaway experiments Unwanted Internet traffic Exhausting disk space Misuse of experimental service by end users E.g., to traffic in illegal content Corruption of a slice Via theft of experimenter’s credentials or compromise of slice software Exploitation of GENI itself Compromise of host O/S DoS or compromise of GENI management infr

Requirements: Do no harm
Explicit delegations of authority Node owner  GMC  Researcher  students  … Least privilege Goes a long way toward confining rogue activities Revocation Keys and systems will be compromised Auditability Scalability/Performance Autonomy/Federation/Policy Neutrality Control ultimately rests with node owners, can delegate selected rights to GMC

Modeling Access Control in Logic
Expressing Beliefs: Bob says F It can be inferred that Bob believes that F is true Bob signed F Bob states (cryptographically) that he believes that F is true Types of Beliefs: Bob says open(resource, nonce) Bob wishes to access a resource Bob says (Alice speaksfor Bob) Bob wishes to delegate all authority to Alice Bob says delegate(Bob, Alice, resource) Bob wishes to delegate authority over a specific resource to Alice Inference Rules (examples): A says (B speaksfor A) B says F A signed F speaksfor-e says-i A says F A says F Proofs: Sequence of inference rules applied to beliefs

Traditional Access Control Lists
Part of the TCB. Received in the request. Scott speaksfor Mike.Students ? ? Scott signed open(D208) delegate(Mike, Mike.Students, D208) ? Mike.Students says open(D208) ? Note: not signed Mike says open(D208) Stored in the reference monitor. Part of the TCB.

A “Proof Carrying” Approach
Received in the request. Mike signed (Scott speaksfor Mike.Students) ? ? Scott signed open(D208) Mike signed delegate(Mike, Mike.Students, D208) ? Mike.Students says open(D208) ? Mike says open(D208) Stored in the reference monitor. Part of the TCB.

Authorization Example (simplified)
Slivers 5) You can authorize X to send to GENI nodes Student send Machine X Resource monitor 4) You can authorize X to send to GENI nodes X says send? Local administrator 1) Delegate: all authority Professor University 1 University 2 2) You can authorize X to send to GENI nodes 3) You can authorize X to send to GENI nodes GENI Management Central

Authentication and key management
GENI will have a PKI Every principal has a public/private key E.g., users, administrators, nodes Certified by local administrator Keys sign certificates to make statements in formal logic (identity, groups, authorization, delegation, …) Private key compromise an issue Encrypted with user’s password? Off-line attacks Smart card/dongle? Most secure, but less usable Capture-resilient protocols: A middle ground

Capture-Resilience Properties
? Server  Server …, 3, 2, 1  Attacker must succeed in online dictionary attack Attacker gains no advantage  ? Server Server …, 3, 2, 1  Attacker must succeed in offline dictionary attack Attacker can forge only until server is disabled for device

Delegation in Capture-Protection
Authorize Authorize Revoke Authorize

Intrusion Detection Traditional intrusion detection methods may not suffice for monitoring experiments Misuse detection Specify bad behavior and watch for it (Learning-based) Anomaly detection Learn “normal” behavior and watch for exceptions Bad Bad Normal Normal Good Good Problem: Experiments do lots of things that look “bad” Problem: Experiments may be too short-lived or ill-behaved to establish “normal” baseline

Intrusion Detection Specification-based intrusion detection is more appropriate for monitoring experiments Fits in naturally with authorization framework, as well Specification-based intrusion detection Specify good behavior and watch for violations Bad Normal Good

Audit Log Example: PlanetFlow
PlanetFlow: logs packet headers sent and received from each node to Internet Enables operations staff to trace complaints back to originating slice Notify experimenter; in an emergency, suspend slice All access control decisions can be logged and analyzed post-hoc To understand why a request was granted (e.g., to give attacker permission to create a sliver) To detect brute force attacks

Packet Logging Architecture
Query (Web) CGI  SQL query MySQL, etc Packet headers  sessions Packet headers  batch Divert packets Database Daemon Netfilter Kernel

Performance Straightforward approach Modifications 10 Gbps?
2.5% of CPU; < 1% of bandwidth Modifications Group sessions in kernel Lazily add to database Eliminate intra-GENI traffic Limit senders if auditing too expensive 10 Gbps? Large flows easy, small flows even realistic?

Security Deliverables (21E)
Definition of certificate format and semantics (2E) Certificate mgmt svc (construction, storage, lookup and caching) (5E) Access control guard (resource monitor) (2E) Security policy language and certificate revocation, and UI (3E) Secure and reliable time service (purchase) Proof generator (2E) Specification-based intrusion detection service (5E) Capture protection server and client software (2E) #E represents estimate in developer-years, assuming a five year construction span, excluding management, system test, overhead, and risk factor * WBS Number: * WBS Task Name: Security service* Description: * Deliverables: ** 1) Definition of certificate format and semantics (2E)** 2) Certificate mgmt svc (construction, storage, lookup and caching) (5E)** 3) Access control guard (resource monitor) (2E)The security resource monitor is a software component that confirms thata request to access a resource is consistent with access-control policy.It does so by verifying a logical proof of this assertion that derivesfrom digitally signed credentials. It does not construct this proofitself, but rather validates a proof provided with the request. Notethat this component is central to security in the facility.** 4) Security policy language and certificate revocation, and UI (3E)This tool enables a user to configure security policy. Configuringpolicy results in the creation of credentials, digitally signed by theuser's private key. This tool can be used by GENI administrators andresearchers alike.** 5) Secure and reliable time serviceWe should be able to purchase/license this from someone.** 6) Proof generator (2E)This component generates a proof demonstrating that a request iscompliant with the access-control policy dictated by the securityresource monitor ruling on the request.** 7) Specification-based intrusion detection service (5E)** 8) Capture protection server and client software (2E)The client and server software implement capture-protected cryptographickeys, which resist misuse even if captured.* Dependencies: Storage system (for storing certificates)... expects transactional semanticsCommunication system* People-years:21E --> 21 Engineers 8 QA 4 Managers 4 Architects* Non-labor costs:No non-labor costs.Development systems in overhead. Testbed could be included.* Downscoping: 30%* Novelty: 30%* Confidence: 2x* Rationale: ** Deliverable #3: This component implements two types of functions: (i) itprovides to clients a formal description of the access-control policythat governs the request a client plans to make, together with a nonceidentifier; and (ii) it verifies the proof submitted with the client'srequest to make sure that it is a valid proof of the requiredaccess-control policy, that it derives from credentials bearing validdigital signatures, and that the proof goal contains a recently-issuednonce identifier that has not already been received in a previous proofgoal (and hence that the proof is not replayed). For a person familiarwith the state-of-the-art in logics and proof checking, the estimateshould be conservative. That said, since this component is central toGENI security, a greater degree of assurance should be applied to itsanalysis and testing.** Deliverable #4: The most challenging aspect of this tool is the user interface. At ahigh level, it must enable a user to manage a namespace that can includenames of individuals (e.g., "Tom") and groups (e.g., "students").Managing this namespace involves conveying authority from one identifierto another, where an identifier can be either a name in the namespace ora public key. So, for example, this tool enables a person to specifythat a public key carries the authority of (or "speaks for") a name(e.g., "Tom"). It also enables one name (e.g., "Tom") to carry theauthority of another (e.g., a group called "students"), for example.The resulting credentials created by the tool are expressed in formallogic, for use in proving (WBS YYY) and checking proofs (WBS )of access-policy compliance.** Deliverable #6: This component takes as input a policy goal to prove, which itselfincludes a nonce identifier. This goal is issued, e.g., by the securityresource monitor guarding the resource to which access is to berequested. This component generates a proof that the access complieswith the goal policy, if in fact it does. This proof derives fromdigitally signed credentials that can be provided as input to thiscomponent, or that this component is in charge of retrieving fromelsewhere. The former option, in which all requisite credentials areprovided as input, is the "downscoped" version of this component and issomewhat simpler than the full version implementing credential (orlemma) retrieval from repositories. However, it should be noted that inits downscoped version, this component will need to make use ofadditional external functionality to ensure that credentials areavailable when needed to create proofs. This may take the form ofapplication-specific logic that "pushes" credentials where they arelikely to be needed, for example.** Deliverable #8: The client and server algorithms are specified in existing documents andare well understood. Implementations have also been demonstrated.Timing attacks on the server algorithm should be able to be remediedwith known techniques.

Security: Open Issues DoS-resistant GENI control plane?
Initial control plane will employ IP and inherit the DoS vulnerabilities thereof GENI experimentation may demonstrate a control plane that is more resistant Design proof-carrying certificates to operate independently of communication channel Privacy of operational data in GENI? Operational procedures and practices Central to security of the facility

Example Substrate Internet GENI Backbone
Site B GENI Backbone Site A Suburban Hybrid Access Network Sensor Net PEN Urban Grid Access Network PAP PAP: Programmable Access Point PEN: Programmable Edge Node PEC: Programmable Edge Cluster PCN: Programmable Core Node GGW: GENI Gateway GGW PEC PCN

Programmable Edge Cluster: HW
Capabilities should be driven by science plan Draft: 200 sites, cluster of PCs at each Workhorse nodes: running experiments, emulating higher speed routers, distributed services Multicore CPU, 8GB of DRAM, 1TB disk, gigE High speed switch and router connecting to rest of GENI Cut in latest iteration of draft plan: 20 sites, cluster of 200 PCs each Compute/storage intensive applications

Programmable Edge Cluster: SW
RPC over TCP/IP GENI Control VServer Sliver VServer Sliver Stock Linux vserver interface GENI Control VM Sliver VM Sliver VServer Kernel Stock VMM interface e.g. PVI Low-level VMM (e.g., Xen) Experiments run as a vserver sliver or as a VM sliver Communicate with GENI management code (running as sliver) through RPC

Execution environments
PlanetLab-like best-effort VServers Fixed kernel, convenient API Weak isolation between slivers Weaker security for critical components Small number of standard configurations minimal, maximal, expected Virtual machine monitors (e.g., Xen, VMWare) Choice of prepackaged or custom kernels (as in Emulab) Linux + click Others possible: singularity, windows, raw click Stronger resource guarantees/isolation poor I/O performance limited number of VMs (scalability)

Service Location Services can be implemented at any of a set of levels: Inside VMM if kernel changes are required e.g., to implement fast segment read/write to disk In its own VM on the VMM To configure VMM, or if security is needed E.g., the GENI component manager; GENI security monitor In the linux vserver If linux kernel changes are needed, e.g., traffic monitoring In its own vserver sliver Running as a best effort service, e.g., vserver component manager In a library linked with experiment E.g., database, cluster file I/O

Booting To boot a node To boot a sliver
Trusted computing hardware on each node Secure boot fetches initial system software Initial machine state eventually comprises: Virtual Machine Monitor (e.g. Xen) Initial domain: GENI Domain (GD). Possibly VServer kernel by default To boot a sliver Send authorized request to GENI Domain GD verifies request; creates new xen/vserver domain Loads software that contains sliver secure boot (GENI auth code, ssh server, etc.) See reference component design document for details

Containment & Auditing
Limits placed on slice “reach” restricted to slice and GENI components restricted to GENI sites allowed to compose with other slices allowed to interoperate with legacy Internet Limits on resources consumed by slices cycles, bandwidth, disk, memory rate of particular packet types, unique addrs per second Mistakes (and abuse) will still happen auditing will be essential network activity slice responsible user(s)

Edge Cluster WBS Deliverables
See GMC specification

Open Questions Resource allocation primitives on each node
Reservation model: % CPU in each a given time period? Strict priorities? What about experiments/services whose load is externally driven (e.g., a virtual ISP)? Other resources with contention: memory, disk Fine-grained time-slicing of disk head with real time guarantees is unlikely to work as intended Either best effort, or disk head per application (means we need at least k+1 disk heads for k disk intensive applications) How are service-specific resources represented (e.g., segment store)? How are resources assigned to services? Through experiments giving them resources explicitly, or via configuration?

More Open Questions Kernel changes needed in xen, vservers to implement resource model Custom GENI OS to run on xen? Allocation of IP address/port space to slivers well known ports Efficient vserver sliver creation Configure new sliver (e.g., to run a minimal script) with minimal I/O overhead, minimal CPU time Can/should vservers run diskless? Make it easy to share file systems read only Vserver image provided by symlinks or NFS loopback mounts?

More Open Questions What is the agreement with hosting sites?
Rack space IP address space (\24 per site?) Direct connectivity to Internet BGP peering? Bandwidth to Internet? Local administrative presence? Ability to add resources under local control? Absence of filtering/NATs

Storage for GENI Enables future network applications, which integrate storage, computation, and communication Large-scale sensor networks Digital libraries that store all human knowledge Near-on-demand TV Experiments also need storage: Experiment results Logs (e.g., complete packet traces) Huge data sets (e.g., all data in Web) Running the experiment (binaries, the slice data, etc.) Managing GENI requires storage: Configuration Security and audit logging Storage will be distributed and shared

Storage Goals Enable experiments that integrate computation and storage Provide sufficient storage (e.g., 200 Petabytes) Provide convenient access to the storage resources Provide high performance I/O for experiments Allow the storage services to evolve Allow experimenters to build new storage services Balance expectation of durability Permit effective sharing of storage resources User authentication and access control Resource control

Overall Storage Design
Node-level storage building blocks Higher level distributed storage abstractions Dependencies on authorization, resource management Node Cluster Wide-area storage services

Node-level Storage Support
Convenient access for experimenters & admins File system interface SQL database on node (likely) Needed by many apps + GENI managers e.g., auditing system Extensible access for service creators Raw disk / block / extent store Direct access for building services “Loopback” filesystem support Facilitate creating distributed storage services Efficient use of disk bandwidth

Distributed Storage Support
Consolidate frequently used data management services Convenient administration and experimentation Transparent wide-area file system Data push services: install data “X” on 200 nodes Log storage and collection for research and management High performance distributed I/O e.g., an improved Google File System (cluster) ok to compromise on application-level transparency Possible high-peformance wide-area filesystem Write-once, global high-peformance storage Storage for constrained nodes (e.g., sensors)

Storage Deliverables (28E)
Local filesystem interface (1E) SQL database (1E) Services for creating new storage services and intercepting storage system calls (1E) Raw disk interface (3E) Block-based storage interface (3E) A wide-area filesystem for administration and experiment management (4E) A high-performance cluster filesystem (4E) Fast write-once/read only storage services. (3E) A reduced complexity storage interface for constrained nodes. (3E) Maintenance and bug fixing throughout life cycle. (5E) * WBS Number: * WBS Task Name:Storage Services* Description:Create the software and hardware infrastructure to allow administrators,the developers of storage-oriented services, and researchers access tostorage resources on GENI. This infrastructure must meet the differentneeds of several groups of users, from access to highly reliable storagefor administrative purposes, to extremely fast storage forhigh-performance experiments, to providing convenient access to remoteresources to facilitate experimentation.* Deliverables:** 1) Local filesystem interface. (1E)The only difficulty here is that it must support the resourceaccounting and authorization model. Assuming that the host OS has beenmodified already, this task should be a simple integration with existingquota mechanisms, depending on the richness of the sharing and neededgranularity of resource accounting.** 2) SQL database (1E)The challenges are as above: resource control integration. Thiscomponent will require selecting and modifying an existing off-the-shelfdatabase, such as mysql or postgres; it will not involve creating newdatabase mechanisms.** 3) Services for creating new storage services and intercepting storage system calls (1E)Our design target for this is a robust "loopback" filesystem interface.Such interfaces exist for Linux and other operating systems. The majorchallenges are: - Integration with management, authentication, and accounting - Ensuring that the interface is robust and stable - Ensuring that the interface can be used across slices to allow developers to build upon each others' services.** 4) Raw disk interface (3E)As above: resource integration and accounting. This mechanism mayrequire more or less effort depending on the level of performancedesired and the granularity of sharing and allocation. Options rangefrom allocating a VM-level share of the disk to allocating large files.The complexity of this task depends on the choices available from theunderlying VM.** 5) Block-based storage interface (3E)Not too hard, but it needs to be fast, and integrated withresource/identity.** 6) A wide-area filesystem for administration and experiment management (4E)We envision this being a port of an existing filesystem to the GENIframework. Suitable options include SFS, AFS, or NFSv4. The challengesagain are integration with accounting and authorization.** 7) A high-performance cluster filesystem (4E)Existing ones are difficult. Could base on the google filesystemclones, or Lustre, or others. But estimate 4 PYs to actually get itworking and robust and the deployment of such things automated.** 8) Fast write-once/read only storage services. (3E)Using things like Shark, DOT, etc. Must settle on one, make it robust.A few person-years.** 9) A reduced complexity storage interface for constrained nodes. (3E)** 10) Maintenance and bug fixing throughout life cycle. (5E)Update storage interfaces as hardware and system software changes.* Dependencies:Security serviceResource allocation policy* Labor costs28 engineers --> 28 engineers 11 QA 5 architects 5 managers* Downscoping:75% Based on nice-to-have features are most complex implementation.* Confidence factor:2XMany elements of the storage services WBS depend on the difficuly ofintegration with resource accounting and authorization, which could varyby up to a factor of two.Deliverable 7: Possibility to purchase could reduce cost, but complexitycould dramatically increase cost.Deliverable 11: This is potentially unbound complexity. We just don'tknow how to estimate cost without additional user feedback onrequirements.* Rationale:Many of the components of the GENI storage services are off-the-shelf ornearly off-the-shelf components that must be modified to work within theGENI resource accounting and authorization framework. When suchcomponents are usable, they are a lower-risk / lower-cost aspect of theproposed design. There is also some work that must be done to each ofthese components to ensure that they are automated sufficiently toscale, from an administrative perspective, to the planned number ofnodes in GENI. Other parts of the services are new or re-engineered:the block-based storage interface, for instance. The logging serviceswill need to be constructed de novo.We removed the high-performance distributed filesystem deliverable. Thistask is very difficult and extremely costly (existing commercialimplementations took 100's of man years to complete)Data push services and logging services moved to communication.

Resource Allocation Goals
Define framework for expressing policies for sharing resources among global participants Design mechanisms to implement likely policies Resource allocation mechanisms should provide resource isolation among principles be decentralized support federation and local site autonomy be secure provide proper incentives incentive for participants to contribute resources to the system and to keep them up and running incentive for participants to use only as much resources as they really need

Existing Resource Allocation Model
Existing model for PlanetLab resource allocation all resources placed in a central pool all users compete for all resources Pros: simple, no complex policy well understood, tried and true time sharing Downsides: no incentive for anyone to add additional resources no incentive to keep local machines up and running no incentive for anyone to use less than “as much as possible” all best-effort—can’t reserve fixed share of a node

Example Allocation Policies
All resources placed in central pool [Supercomputer Center] Portion of resources reserved for dedicated use [SIRIUS] Portion of resources available for bidding [Bellagio] Pair-wise resource peering [SHARP]

Resource Allocation Proposal
Three pieces GMC runs a centralized Resource Broker (RB) Each site runs a Site Manager (SM) Each component (e.g. node) runs a Component Manager (CM) Site donates some portion of its resources to GENI Site’s SM receives a Token of value proportional to value of resources contributed SM subdivides Token among site users To access a resource User presents token + resource request to RB RB returns Ticket (a lease for access to requested resource) User presents Ticket to resource’s CM to obtain sliver “Back door”: GENI Science Board can directly issue Tokens and Tickets to users and sites

GENI Resource Allocation
Site Manager Site Manager Resource Broker Site Manager Site Manager

Site Manager Site Manager Resource Broker Donation Donation Site Manager Site Manager Sites donate some portion of resources to GENI

Site Manager Site Manager Resource Broker Site Manager Site Manager In exchange, GENI issues Tokens to each site with value proportional to that of donated resources each token carries a value, so 10 tokens of value 1 are equivalent to 1 token of value 10 any principal can subdivide tokens

Site Manager Site Manager Resource Broker Site Manager Site Manager User Site Manager delegates some resource privileges to user by issuing a Token of smaller denomination

Site Manager Site Manager Resource Broker Site Manager Site Manager User Resource Discovery Service User consults any resource discovery service to locate desired resources

Site Manager Site Manager Resource Broker Site Manager Site Manager User User presents Token and resource request to Resource Broker

Site Manager Site Manager Resource Broker Site Manager Site Manager User Resource Broker (possibly after consulting with Component Managers on requested resources) returns one Ticket for each requested resource Ticket is a lease: guarantees access to resource for period of time

Site Manager GENI Site Manager Resource Broker CM CM CM CM CM CM Site Manager Site Manager User CM CM CM CM CM CM User presents each Ticket to a Component Manager, receives Sliver (handle to allocated resources)

Site Manager Site Manager Science Board Site Manager Site Manager GENI Science Board can directly issue Tokens and Tickets to users and sites to reward particularly useful services or hardware

Additional Details: Donations
<xsd:complexType name="Donation"> <xsd:sequence> <xsd:element name="GUID" type="xsd:string"/> <xsd:element name="Recipient" type="xsd:string"/> <xsd:element name="RSpec" type="tns:RSpec"> <xsd:element name="Signature" type="xsd:base64Binary"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="RSpec"> <xsd:sequence> <xsd:element name="Issuer" type="xsd:string"/> <xsd:element name="Resources" type="tns:ResourceGroup"/> <xsd:element name="IsolationPolicy" type="tns:IsolPolicy"/> <xsd:element name="AUP" type="tns:AUP“> <xsd:element name="ValidStart" type="xsd:dateTime"/> <xsd:element name="ValidEnd" type="xsd:dateTime"/> </xsd:sequence> Donation recipient is Resource Broker RSpec specifies resources donated For now, we assume at least one donation message per node being donated (since the existing GMC type ResourceGroup specifies a single node); eventually, would like some way to specify aggregates (e.g. a whole cluster) We assume RSpec specifies supported sharing policies for donates resources (guaranteed-share, best-effort proportional share, etc.) IsolPolicy and AUP are just Strings for now, since we don’t know exactly what they will look like

Additional Details: Tokens
<xsd:complexType name="Token"> <xsd:sequence> <xsd:element name="Issuer" type="xsd:string"/> <xsd:element name="GUID" type="xsd:string"/> <xsd:element name="Recipient" type="tns:SliceName"/> <xsd:element name="Value" type="xsd:decimal"/> <xsd:element name="ValidStart" type="xsd:dateTime"/> <xsd:element name="ValidEnd" type="xsd:dateTime"/> <xsd:element name="ParentGUID" type="xsd:string"/> <xsd:element name="Signature" type="xsd:base64Binary"/> </xsd:sequence> </xsd:complexType> Token issuer is Resource Broker, Recipient is a slice (or, in “back-door” process, Token issuer is Science Board, recipient is a slice) ParentGUID allows tokens to be chained together. For example, the token a SM gives a user will be chained to the token the RB gave to the SM (which the SM subdivided). This allows detection of SMs who manufacture more Tokens than allowed, etc. Meaning of value is not pre-ordained; it is up to the Resource Broker to determine value required for different kinds of tickets for different resources

Additional Details: Tickets
<xsd:complexType name="Ticket"> <xsd:sequence> <xsd:element name="GUID" type="xsd:string"/> <xsd:element name="Recipient" type="tns:SliceName"/> <xsd:element name="RSpec" type="tns:RSpec"/> <xsd:element name="ValidFor" type="xsd:duration"/> <xsd:element name="Signature" type="xsd:base64Binary"/> </xsd:sequence> </xsd:complexType> RSpec specifies resources guaranteed to user Ticket issuer is Resource Broker (or, in back-door process, Science Board) Ticket recipient is a slice ValidFor specifies how long the user can use the specified resources; allows authorization policies such as “you can use the node for any 30-minute period this week” (RSpec time spec indicates the “this week” part, while Ticket ValidFor specifies the “30-minute” part)

Implementing RA policies
Current PlanetLab RA policy (per-node proportional share) Site Manager donates nodes SM receives >= N*M Tickets N=# of PL nodes, M=# users at site SM gives each user N Tokens of value 1 User presents one Token of value 1 and a resource request to RB RB returns a Ticket authorizing prop.-share use of requested node User presents Ticket to CM, which returns Sliver on its node User’s share = 1/P where P=number of users (slivers) on the node Weighted proportional share As above, but user presents Token of value T to RB (T may be > 1) User’s share = T/Q where Q=number of Tokens redeemed by other slivers that are using the node

Implementing RA policies (cont.)
User wants guaranteed share of a node’s resources User presents token of value T + resource request to RB RB returns a Ticket for guaranteed T% share of requested node User presents Ticket to CM, which returns Sliver on its node sliver is guaranteed a T% share of the node’s resources But if RB has already committed more than 100-T% of the node, either 1) RB refuses to grant Ticket, then (a) user tries again later, or (b) user tries again immediately, specifying a later starting time, or (c) out-of-band mechanism used to queue the request and issue callback to user when T% of the resource is available 2) Or, RB grants the Ticket, setting ValidFor to requested duration; user presents Ticket at any time between ValidFrom and ValidTo (1a) and (1b) the user tries to guess when the resource will be available (1c) is current Emulab batch scheduling model (2) is a model where user says I will run whenever you want, but I need to know now when that will be Benefit of (1a) and (1b) is that user keeps control of when will have access; in (1c) and (2) the RB dictates to the user when the access will be granted

Implementing RA policies (cont.)
Resource auctions RB coordinates the bidding “Cost” of using a resource is dynamically controlled by changing “exchange rate” of Token value to Ticket share Loan/transfer/share resources among users, brokers, or sites Tokens are transferrable, ParentGuid traces delegation chain Sites and users can give tokens to other sites or users

Resource Alloc Deliverables (17E)
Public API to set resource privileges on per-user/per-site basis. Public API to set use privileges on per-component basis (for site admins). Initial web-based interface to allow GENI Science Council to set per-user/per-site privileges using API in step 1. Initial web-based interface to allow administrators to set policy for locally available components. Refined versions of 1, 2, 3, 4 above based on user and community feedback. Design of capabilities to represent GENI resources. Design of tickets representing leases for access to individual GENI components. Initial implementation of Resource Brokers and client software to present requests for resources and to obtain the appropriate set of tickets. Site administrator API and web interface to assign privileges on a per-user basis. Integration with resource discovery service. Integration with experiment management software.

Open Issues Specifying resource aggregates (e.g. a cluster)
Multiple, decentralized RBs rather than a single centralized RB run by GENI Describing more complex sharing policies Build and deploy real implementation Site Manager, Resource Broker, Component Manager as SOAP web services build on top of existing GMC XSD specifications

Resource Allocation Conclusion
Goal: flexible resource allocation framework for specifying a broad range of policies Proposal: centralized Resource Broker, per-site Site Managers, per-node Component Managers Properties rewards sites for contributing resources with special back-door to give users and sites bonus resources encourages users to consume only the resources they need allows to express a variety of sharing policies all capabilities (donations, tokens, tickets, slivers) time out allows resources to be garbage collected allows dynamic valuation of users and resources currently centralized, but architecture allows decentralization secure (all capabilities are signed)

Experimenter’s Support Toolkit
Make it easy to set up and run experiments on GENI Goal: make GENI accessible to the broadest set of researchers, including those at places with little prior institutional experience Support different types of experiments/users: Beginners vs. expert programmers Short-term experiments vs. long-running services Homogeneous deployments vs. heterogeneous deployments

Typical Experiment Cycle
Application Obtain Resources Connect To Resources Prepare Resources Start/Monitor Processes Clean Up Resource Pool Picture will look different for long-running services as process monitoring, resource preparation, etc. will proceed in a cycle

Desired Toolkit Attributes
Support gradual refinement: Smooth implementation path from simulation to deployment Same set of tools for both emulation and real-world testing Make toolkit available in different modes Stand-alone shell Library interface Accessible from scripting languages Enable incremental integration with other services For instance, should be able to change from one content distribution tool to another by just changing a variable Sophisticated fault handling Allow experimenters to start with controlled settings and later introduce faults and performance variability Library support for common design patterns for fault-handling

Toolkit as an Abstraction Layer
Long running Services Entry-level users Services requiring fine- grained control Shell Scripting Language API Components Experiment Instantiation Job Control I/O exceptions Debugging, Transactions Support end-hosts, heterogeneity Scalability, Resource discovery LAN Clusters Emulab Modelnet GENI

Basic Toolkit Components
System-wide parallel execution Start processes on a collection of resources Integrate support for suspend/resume/kill Issue commands asynchronously Support various forms of global synchronization (barriers, etc.) Node configuration tools: Customizing node, installing packages, copying executables, etc. Integrate with monitoring sensors Distributed systems sensors such as slicestat, CoMon Information planes for network performance (such as iPlane) Integrate with other key services Content distribution systems, resource discovery systems, etc.

Advanced Components Key stumbling block for long-running services is ensuring robustness in the presence of failures Need to provide support for incremental resource allocation Library support for common design patterns to handle faults Support for transactional operations and two-phase commits, support “execute exactly once” semantics, etc. Support for detecting abnormal program behavior, application-level callbacks, debugging, etc. Reliable delivery of control signals, reliable delivery of messages

Experiment Support (30E)
Tools for performing system-wide job control: such as executing the same command on all nodes with desired levels of concurrency, etc. Tools for performing operations in asynchronous manner and synchronizing with previously executed commands. Tools for setting up necessary software packages and customizing the execution environment. Tools for coordinated input-output (copying files and logs). Exposing the toolkit functionality in a library API. Exposing the toolkit functionality using a graphical user interface (6-8E) Integration of tools into scripting languages. Provide simple ways for users to specify desired resources. Resilient delivery of control signals. Provide transactional support for executing system-wide commands. Provide support for detecting faults in experiments. Scalable control plane infrastructure -- dissemination of system-wide signals, coordinated I/O, and monitoring program execution should all be done in a scalable manner (3-5E) Interface with content distribution, resource discovery, slice embedding systems. Interface with the information plane for communication subsystem and various sensors monitoring the testbed. Tools for checking for global invariants regarding the state of a distributed experiment (4E) Logging to enable distributed debugging. Debugging support for single-stepping and breakpoints. Rationale: Most of the deliverables mentioned above can be performed with a single person-year's amount of work. Exceptions are listed below. Task #6 is 6-8 person years, based on the Emulab experience as graphical user interfaces are both critical and hard to get right the first time around. Task #12 is 3-5 person years as it takes a substantial amount of software to provide a scalable infrastructure. Task #15 is probably 3-4 person years and could be potentially downscoped. Maintaining code and folding in user feedback is also a substantial task and is potentially difficult to estimate. NOTE: this task is substantially harder than in planetlab because the tools must interact with many different kinds of devices (access points, routers, xen domains, etc). Can be downscoped: 100% Novelty: 20% Confidence Factor: 2x

Slice Embedding Deliverables (25E)
1) Resource specification language for describing user's needs. 2) Generic matching engine. 3) Algorithms for efficient matching. 4) Matching engine for each subnet. 5) Stitching module to compose results from different subnets. 6) Integration with the resource discovery system to identify available resources. 7) Integration with the resource allocation system to ensure allocation. RATIONAL: The initial version of the API could provide basic functionality: just support allocation of machines/compute time, rather than more complex resources such as wireless frequency, and can use simple algorithms for performing the matching even if it is expensive. The initial rollout should however have the module interface with resource allocation and resource discovery. I am estimating 5 people-years for this task. Subsequent work can focus on developing subnet specific matching engines and resource-specifications, accounting for about 10 people years. Finally, I am budgeting 5 people years for developing scalable algorithms that can work even when the user specifies complex constraints on the resources desired. 5 more people-years for folding in community feedback and maintaining the software. The work on better algorithms could be down-scoped. Interestingly enough, better algorithms and a rich resource specification framework accounts for the novelty in this work. This is a service critical to ensure smooth operation of GENI. Can be downscoped: 100% Novelty: 40% Confidence Factor: 3x

Monitoring Goals History Metrics Triggers
Reduce cost of running system through automation Provide mechanism for collecting data on operation of system Allow users to oversee experiments Infrastructure (i.e., node selection, slice embedding, history, etc.) History Clusters, Grid – Ganglia PlanetLab – CoMon, Trumpet, SWORD Metrics Node-centric: CPU, disk, memory, top consumers Project-centric: summary statistics (preserves privacy) Triggers Node, project activity “out of bounds” Warning messages, actuators Combinations with experiment profiles

Operations Support Issues
Two categories of support systems Online: monitor the function and performance of GENI components in real-time Use the ITU FCAPS model to classify necessary support systems Offline: problem tracking, maintenance requests, and inventory Build or buy decisions First preference is to use open-source if available, appropriate, and competitive Develop re-distributable extensions as appropriate Second preference is to purchase COTS software Evaluate cost per seat, educational discounts, and impact of restricted access to system data Last choice is to build systems from scratch if no suitable alternatives exist

FCAPS (Fault, Configuration, Accounting, Performance, Security)
Fault management Detect and track component faults in running system Initiate and track the repair process Example systems: Nagios, HP OpenView, Micromuse Netcool Configuration management Automate and verify introduction of new GENI nodes Provision and configure new network links Track GENI hardware inventory across sites Examples: PlanetLab boot CD, Telcordia Granite Inventory, Amdocs Cramer Inventory, MetaSolv Accounting Manage user and administrator access to GENI resources Map accounts to real people and institutions Examples: PlanetLab Central, Grid Account Management Architecture (GAMA)

FCAPS (Fault, Configuration, Accounting, Performance, Security)
Performance management Fine-grained tracking of resource usage Queryable by administrators and adaptive experiments Detecting and mitigating transient system overloads and/or slices operating outside their resource profiles Examples: CoMon, HP OpenView, Micromuse Netcool Security management Log all security-related decisions in an auditable trail Viewable by cognizant researcher and operations staff Monitor compliance with Acceptable Use Policy Try to detect certain classes of attacks before they can cause significant damage Examples: Intrusion detectors, compliance systems, etc.

Problem Tracking All researcher/external trouble reports, plus any traffic incident reporting Examples: this filesystem seems corrupt, this API does not seem to match the behavior I expect, or “why did I receive this traffic?” Receive alerts/alarms from platform monitoring system (e.g., Nagios, OpenView, etc.) Track all reported alarms, delegate to responsible parties, escalate as needed Classify severity, prioritize development/repair effort Examples: Request Tracker (RT), Bugzilla, IBM/Rational ClearQuest

Operations Support (31E)
1) GENI Fault Management System software (4E) 2) GENI Configuration Management System software (4E) 3) GENI Accounting Management System software (2E) 4) GENI Performance Management System software (3E) 5) GENI Security Management System software (2E) 6) GENI Problem Tracking System software (2E) 7) GENI Community Forum software (2E) 8) Lifecycle management of all software components (12E) Novelty: 25% of the work will require innovation. While many of the Operations Portal functions are reasonably well-understood for conventional networks, the programmability, virtualization, and scale inherent in GENI will require new ideas in Operations. Fault and performance management for GENI will require innovation. In particular, new research will be needed to determine the best way of performing root-cause analysis of faults in a system like GENI. Confidence Factor: 3X Rational: We assume that the GENI Fault, Accounting, Performance, and Security Management systems can be derived from existing open-source tools that provide similar functions in other environments. The GENI Configuration Management System may need to be built from scratch or derived from commercial software. The Problem Tracking System and GENI Community Forum can likely be derived with little modification from existing open-source software that performs similar functions. **Deliverable 8: We assume 4E per year recurring costs for three years to maintain and upgrade software components.

Communication Substrate
Bulk data transfer. Small message dissemination (e.g., application level multicast) for control messages Log/sensor data collection Information plane to provide topology information about both GENI and the legacy Internet Secure control plane service running on GENI so that device control messages traverse over the facility itself, and therefore cannot be disrupted by legacy Internet traffic. essential if the facility is to be highly available.

Communication Deliverables (11E)
1. Bulk data transfer (e.g., CoBlitz or Bullet), to load experiment code onto a distributed set of machines (3E) 2. Small message dissemination (e.g., application level multicast) for control messages to a distributed set of machines (1E) 3. Log/sensor data collection, from a distributed set of machines to a central repository/analysis engine (3E) 4. Information plane to provide topology information about GENI and the legacy Internet (1E) 5. Software maintenance and upgrades (3E) Downscoping: 100% If these items are left off, the downside will be lower efficiency in utilizing resources, but the system will continue to function. Novelty: 20% Depending on the final requirements, the log collection service will require some thinking to make it work well. Confidence factor: 2x Rationale: For budget reasons, we eliminated a deliverable, the secure control plane service. This service uses GENI resources, so that device control messages traverse over the facility itself, and therefore cannot be disrupted by legacy Internet DoS attacks. Since this service can be built as a service running on GENI, the lack of the service does not prevent an experimenter who needs it from building it themselves.

Legacy Services Virtualized HTTP (2E) Virtualized DNS (2E)
Allow experiments to share port 80 Virtualized DNS (2E) Allow experiments to share port 53 Client opt-in (12E) Assumes Symbian, Vista, XP, MacOS, Linux, WinCE Distributed dynamic NAT (2E) Connections return to source Virtualized BGP (backbone group)

Prioritization High priority (“can’t live without it”): 60E
Data transfer to set up experiments Local storage Resource allocation/mgt Operations support Medium priority (“should have”): 50E Legacy services Quick and dirty experiment support Efficient disk storage Log data collection Nice to have: 40E Information plane Simplified file system interfaces for execution More functional workbench Slice embedding Notes: Security and edge cluster prioritized elsewhere (part of GMC) Prioritization also needed within each functional area (e.g., ops support) Procedurally, we estimated the budget #s in the following way. We extracted from the various GENI documents that the working group had written, the list of concrete deliverables for each portion of the services effort. Mic Bowman and I came to a consensus estimate of the number of person-years of *developer* effort that would be required to develop that piece of functionality. That is the #E value next to each deliverable. You should realize that number is incredibly squishy, as we often had little more than a one paragraph description of the deliverable, and there's an order of magnitude difference in cost that is possible for a given component based on how thoroughly it is engineered, how complex the API is, etc. We were somewhat, but not overly I think, conservative in estimating the amount of developer time. One reason for being conservative is that at the time we did the estimates, there was no process in place for tracking requirements/interfaces, and therefore we had little assurance we could bound the complexity of each piece (requirements that might be tacked on that would drive up the amount of effort). For example, the project had made no effort to minimize the number of unique pieces of hardware that the software would need to be ported to, which would obviously drive up the cost (at the very least, of the testing effort). Similarly, if we were shooting for a bare bones workable system, the costs would go down significantly. However, we had no way of knowing which world we were operating in, and I think that's still true. We then took the developer effort and did a mechanical expansion to determine the final budget #. To each developer-year, we added 0.4 tester-years, 0.2 manager-years, and 0.2 architect (technical lead) years. This is based on Mic and my estimate of standard practice in industry; academic efforts like PlanetLab use fewer testers and architects/technical leads (so Larry's numbers for these are different), and whether that is a good or a dumb idea is open to debate. There is another factor of 2 added in the spreadsheet due to overhead. We then somewhat arbitrarily assigned a factor of 2 uncertainty factor (in reality, we could have defended 10x uncertainty given the level of specificity we have). In some of the text portions, there is some discussion as to which deliverables have the greatest technical risk, but that's again only a ballpark estimate, and in most cases, we kept the 2x factor regardless, as the minimal defensable uncertainty (that is, there exists a process for which that would be the uncertainty). I believe it would be possible to design a useful system for the budgeted time/effort; whether it is less or more is a question of priorities and having a rational process for managing the list of required features. In order to get a much closer number, we would need to have a lot more detail about each component.

Conclusion We understand most of what is needed to build security, user support into GENI Lots of work still to do to refine design Comments welcome Design not intended as a fixed point

GENI Distributed Services Preliminary Requirements and Design

Similar presentations

Presentation on theme: "GENI Distributed Services Preliminary Requirements and Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GENI Distributed Services Preliminary Requirements and Design

Similar presentations

Presentation on theme: "GENI Distributed Services Preliminary Requirements and Design"— Presentation transcript:

Similar presentations

About project

Feedback