Presentation on theme: "GENI Distributed Services Preliminary Requirements and Design"— Presentation transcript:
1 GENI Distributed Services Preliminary Requirements and Design Tom Anderson and Amin Vahdat (co-chairs) David Andersen, Mic Bowman, Frans Kaashoek, Arvind Krishnamurthy, Yoshi Kohno, Rick McGeer, Vivek Pai, Mark Segal, Mike Reiter, Mothy Roscoe, Ion Stoica
2 Distributed Services Work Status Work split into subgroups:Security Architecture (Mike Reiter, Tom Anderson, Yoshi Kohno, Arvind Krishnamurthy)Edge cluster definition (Mic Bowman, Tom Anderson)Storage (Frans Kaashoek, Dave Andersen, Mic Bowman)Resource Allocation (Amin Vahdat, Rick McGeer)Experiment Support (Amin Vahdat, Arvind Krishnamurthy)Operations Support (Mark Segal, Vivek Pai)Communications Substrate (Arvind Krishnamurthy, Amin Vahdat, Tom Anderson)Legacy Systems (Tom Anderson)
3 Distributed Services Work Status Each section progressed against a defined sequence:OverviewRequirements descriptionPreliminary designRelated work discussionModules and dependencies identifiedWBS estimateEvery part of the design subject to change as science goals are refined, additional information gatheredIncluding during construction
4 Distributed Services Work Status Overall state:Rationale/design needs better documentation and an independent reviewModules identified/initial WBS estimates completedNeed clarity from the GSC as to prioritizationSpecificsSecurity: Design solid; user scenarios neededEdge Node/Cluster: Requirements in flux depending on budget issues; moved to GMCStorage: Requirements solid; modules identifiedResource allocation: Design solid; user scenarios neededExperimenter support: User experience needed to drive requirementsOperations support: Requirements outlinedCommunication Services: Requirements outlinedLegacy Support: Requirements outlined
5 Facility Software Architecture achieve system-wide properties such as security, reproducibility, ..Distributed Servicesname space for users, slices, & componentsset of interfaces (“plug in” new components)support for federation (“plug in” new partners)GMCprovide ability to virtualize and isolate components in a way meaningful to exptsSubstrate Components
6 Facility Software Architecture At hardware device level, component manager, virtualization and isolation layerMinimal layer (GENI management core) to provide basic building blocksRobustness of this layer is critical to the entire project, so keep it small, simple, well-definedAvoid “big bang” integration effortServices layered on top of GMC to provide system-level requirementsModular to allow independent development and evolutionAs technology progresses, post-GENI efforts can replace these services piece by piece
7 User Centric View Researchers Operations staff Ease of describing, launching, and managing experimentsNetwork-level, not node-levelOperations staffAdministrative cost of managing the facilityResource owners (hardware contributors)Policy knobs to express priorities, security policy for the facilitySystem developers (software contributors)GENI developers and the broader research community building tools that enhance GENIEnd usersResearchers and the publicGoal of distributed services group is to make the system more useful, not more powerful
8 Principal Concerns Security and isolation Operational cost and manageabilityUsability and experiment flexibilityScalability, robustness, performanceExperiment development costConstruction cost and schedulePolicy neutrality: avoid binding policy decisions into GENI architecture
9 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
10 Security Architecture What is the threat model?What are the goals/requirements?Access controlAuthentication and key managementAuditingOperator/administrative interfaces
11 Threat model Exploitation of a slice Exploitation of GENI itself Runaway experimentsUnwanted Internet trafficExhausting disk spaceMisuse of experimental service by end usersE.g., to traffic in illegal contentCorruption of a sliceVia theft of experimenter’s credentials or compromise of slice softwareExploitation of GENI itselfCompromise of host O/SDoS or compromise of GENI management infr
12 Requirements: Do no harm Explicit delegations of authorityNode owner GMC Researcher students …Least privilegeGoes a long way toward confining rogue activitiesRevocationKeys and systems will be compromisedAuditabilityScalability/PerformanceAutonomy/Federation/Policy NeutralityControl ultimately rests with node owners, can delegate selected rights to GMC
13 Modeling Access Control in Logic Expressing Beliefs:Bob says FIt can be inferred that Bob believes that F is trueBob signed FBob states (cryptographically) that he believes that F is trueTypes of Beliefs:Bob says open(resource, nonce)Bob wishes to access a resourceBob says (Alice speaksfor Bob)Bob wishes to delegate all authority to AliceBob says delegate(Bob, Alice, resource)Bob wishes to delegate authority over a specific resource to AliceInference Rules (examples):A says (B speaksfor A)B says FA signed Fspeaksfor-esays-iA says FA says FProofs: Sequence of inference rules applied to beliefs
14 Traditional Access Control Lists Part of the TCB.Received in the request.Scott speaksfor Mike.Students??Scott signed open(D208)delegate(Mike,Mike.Students, D208)?Mike.Students saysopen(D208)?Note: not signedMike says open(D208)Stored in the reference monitor.Part of the TCB.
15 A “Proof Carrying” Approach Received in the request.Mike signed (Scottspeaksfor Mike.Students)??Scott signed open(D208)Mike signed delegate(Mike,Mike.Students, D208)?Mike.Students saysopen(D208)?Mike says open(D208)Stored in the reference monitor.Part of the TCB.
16 Authorization Example (simplified) Slivers5) You can authorize X to send to GENI nodesStudentsendMachine XResourcemonitor4) You can authorize X to send to GENI nodesX says send?Local administrator1) Delegate: all authorityProfessorUniversity 1University 22) You can authorize X to send to GENI nodes3) You can authorize X to send to GENI nodesGENI Management Central
17 Authentication and key management GENI will have a PKIEvery principal has a public/private keyE.g., users, administrators, nodesCertified by local administratorKeys sign certificates to make statements in formal logic (identity, groups, authorization, delegation, …)Private key compromise an issueEncrypted with user’s password? Off-line attacksSmart card/dongle? Most secure, but less usableCapture-resilient protocols: A middle ground
18 Capture-Resilience Properties ?ServerServer…, 3, 2, 1Attacker must succeed in online dictionary attackAttacker gains no advantage?ServerServer…, 3, 2, 1Attacker must succeed in offline dictionary attackAttacker can forge only until server is disabled for device
19 Delegation in Capture-Protection AuthorizeAuthorizeRevokeAuthorize
20 Intrusion DetectionTraditional intrusion detection methods may not suffice for monitoring experimentsMisuse detectionSpecify bad behavior and watch for it(Learning-based) Anomaly detectionLearn “normal” behavior and watch for exceptionsBadBadNormalNormalGoodGoodProblem: Experiments do lots of things that look “bad”Problem: Experiments may be too short-lived or ill-behaved to establish “normal” baseline
21 Intrusion DetectionSpecification-based intrusion detection is more appropriate for monitoring experimentsFits in naturally with authorization framework, as wellSpecification-based intrusion detectionSpecify good behavior and watch for violationsBadNormalGood
22 Audit Log Example: PlanetFlow PlanetFlow: logs packet headers sent and received from each node to InternetEnables operations staff to trace complaints back to originating sliceNotify experimenter; in an emergency, suspend sliceAll access control decisions can be logged and analyzed post-hocTo understand why a request was granted (e.g., to give attacker permission to create a sliver)To detect brute force attacks
24 Performance Straightforward approach Modifications 10 Gbps? 2.5% of CPU; < 1% of bandwidthModificationsGroup sessions in kernelLazily add to databaseEliminate intra-GENI trafficLimit senders if auditing too expensive10 Gbps?Large flows easy, small flows even realistic?
25 Security Deliverables (21E) Definition of certificate format and semantics (2E)Certificate mgmt svc (construction, storage, lookup and caching) (5E)Access control guard (resource monitor) (2E)Security policy language and certificate revocation, and UI (3E)Secure and reliable time service (purchase)Proof generator (2E)Specification-based intrusion detection service (5E)Capture protection server and client software (2E)#E represents estimate in developer-years, assuming a five year construction span, excluding management, system test, overhead, and risk factor* WBS Number: * WBS Task Name: Security service* Description: * Deliverables: ** 1) Definition of certificate format and semantics (2E)** 2) Certificate mgmt svc (construction, storage, lookup and caching) (5E)** 3) Access control guard (resource monitor) (2E)The security resource monitor is a software component that confirms thata request to access a resource is consistent with access-control policy.It does so by verifying a logical proof of this assertion that derivesfrom digitally signed credentials. It does not construct this proofitself, but rather validates a proof provided with the request. Notethat this component is central to security in the facility.** 4) Security policy language and certificate revocation, and UI (3E)This tool enables a user to configure security policy. Configuringpolicy results in the creation of credentials, digitally signed by theuser's private key. This tool can be used by GENI administrators andresearchers alike.** 5) Secure and reliable time serviceWe should be able to purchase/license this from someone.** 6) Proof generator (2E)This component generates a proof demonstrating that a request iscompliant with the access-control policy dictated by the securityresource monitor ruling on the request.** 7) Specification-based intrusion detection service (5E)** 8) Capture protection server and client software (2E)The client and server software implement capture-protected cryptographickeys, which resist misuse even if captured.* Dependencies: Storage system (for storing certificates)... expects transactional semanticsCommunication system* People-years:21E --> 21 Engineers 8 QA 4 Managers 4 Architects* Non-labor costs:No non-labor costs.Development systems in overhead. Testbed could be included.* Downscoping: 30%* Novelty: 30%* Confidence: 2x* Rationale: ** Deliverable #3: This component implements two types of functions: (i) itprovides to clients a formal description of the access-control policythat governs the request a client plans to make, together with a nonceidentifier; and (ii) it verifies the proof submitted with the client'srequest to make sure that it is a valid proof of the requiredaccess-control policy, that it derives from credentials bearing validdigital signatures, and that the proof goal contains a recently-issuednonce identifier that has not already been received in a previous proofgoal (and hence that the proof is not replayed). For a person familiarwith the state-of-the-art in logics and proof checking, the estimateshould be conservative. That said, since this component is central toGENI security, a greater degree of assurance should be applied to itsanalysis and testing.** Deliverable #4: The most challenging aspect of this tool is the user interface. At ahigh level, it must enable a user to manage a namespace that can includenames of individuals (e.g., "Tom") and groups (e.g., "students").Managing this namespace involves conveying authority from one identifierto another, where an identifier can be either a name in the namespace ora public key. So, for example, this tool enables a person to specifythat a public key carries the authority of (or "speaks for") a name(e.g., "Tom"). It also enables one name (e.g., "Tom") to carry theauthority of another (e.g., a group called "students"), for example.The resulting credentials created by the tool are expressed in formallogic, for use in proving (WBS YYY) and checking proofs (WBS )of access-policy compliance.** Deliverable #6: This component takes as input a policy goal to prove, which itselfincludes a nonce identifier. This goal is issued, e.g., by the securityresource monitor guarding the resource to which access is to berequested. This component generates a proof that the access complieswith the goal policy, if in fact it does. This proof derives fromdigitally signed credentials that can be provided as input to thiscomponent, or that this component is in charge of retrieving fromelsewhere. The former option, in which all requisite credentials areprovided as input, is the "downscoped" version of this component and issomewhat simpler than the full version implementing credential (orlemma) retrieval from repositories. However, it should be noted that inits downscoped version, this component will need to make use ofadditional external functionality to ensure that credentials areavailable when needed to create proofs. This may take the form ofapplication-specific logic that "pushes" credentials where they arelikely to be needed, for example.** Deliverable #8: The client and server algorithms are specified in existing documents andare well understood. Implementations have also been demonstrated.Timing attacks on the server algorithm should be able to be remediedwith known techniques.
26 Security: Open Issues DoS-resistant GENI control plane? Initial control plane will employ IP and inherit the DoS vulnerabilities thereofGENI experimentation may demonstrate a control plane that is more resistantDesign proof-carrying certificates to operate independently of communication channelPrivacy of operational data in GENI?Operational procedures and practicesCentral to security of the facility
27 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
28 Example Substrate Internet GENI Backbone Site BGENIBackboneSite ASuburban Hybrid Access NetworkSensor NetPENUrban GridAccess NetworkPAPPAP: Programmable Access PointPEN: Programmable Edge NodePEC: Programmable Edge ClusterPCN: Programmable Core NodeGGW: GENI GatewayGGWPECPCN
29 Programmable Edge Cluster: HW Capabilities should be driven by science planDraft:200 sites, cluster of PCs at eachWorkhorse nodes: running experiments, emulating higher speed routers, distributed servicesMulticore CPU, 8GB of DRAM, 1TB disk, gigEHigh speed switch and router connecting to rest of GENICut in latest iteration of draft plan:20 sites, cluster of 200 PCs eachCompute/storage intensive applications
30 Programmable Edge Cluster: SW RPC over TCP/IPGENI ControlVServer SliverVServer SliverStock Linux vserver interfaceGENI ControlVM SliverVM SliverVServerKernelStock VMM interface e.g. PVILow-level VMM (e.g., Xen)Experiments run as a vserver sliver or as a VM sliverCommunicate with GENI management code (running as sliver) through RPC
31 Execution environments PlanetLab-like best-effort VServersFixed kernel, convenient APIWeak isolation between sliversWeaker security for critical componentsSmall number of standard configurationsminimal, maximal, expectedVirtual machine monitors (e.g., Xen, VMWare)Choice of prepackaged or custom kernels (as in Emulab)Linux + clickOthers possible: singularity, windows, raw clickStronger resource guarantees/isolationpoor I/O performancelimited number of VMs (scalability)
32 Service LocationServices can be implemented at any of a set of levels:Inside VMMif kernel changes are requirede.g., to implement fast segment read/write to diskIn its own VM on the VMMTo configure VMM, or if security is neededE.g., the GENI component manager; GENI security monitorIn the linux vserverIf linux kernel changes are needed, e.g., traffic monitoringIn its own vserver sliverRunning as a best effort service, e.g., vserver component managerIn a library linked with experimentE.g., database, cluster file I/O
33 Booting To boot a node To boot a sliver Trusted computing hardware on each nodeSecure boot fetches initial system softwareInitial machine state eventually comprises:Virtual Machine Monitor (e.g. Xen)Initial domain: GENI Domain (GD).Possibly VServer kernel by defaultTo boot a sliverSend authorized request to GENI DomainGD verifies request; creates new xen/vserver domainLoads software that contains sliver secure boot (GENI auth code, ssh server, etc.)See reference component design document for details
34 Containment & Auditing Limits placed on slice “reach”restricted to slice and GENI componentsrestricted to GENI sitesallowed to compose with other slicesallowed to interoperate with legacy InternetLimits on resources consumed by slicescycles, bandwidth, disk, memoryrate of particular packet types, unique addrs per secondMistakes (and abuse) will still happenauditing will be essentialnetwork activity slice responsible user(s)
35 Edge Cluster WBS Deliverables See GMC specification
36 Open Questions Resource allocation primitives on each node Reservation model:% CPU in each a given time period?Strict priorities?What about experiments/services whose load is externally driven (e.g., a virtual ISP)?Other resources with contention: memory, diskFine-grained time-slicing of disk head with real time guarantees is unlikely to work as intendedEither best effort, or disk head per application (means we need at least k+1 disk heads for k disk intensive applications)How are service-specific resources represented (e.g., segment store)?How are resources assigned to services? Through experiments giving them resources explicitly, or via configuration?
37 More Open QuestionsKernel changes needed in xen, vservers to implement resource modelCustom GENI OS to run on xen?Allocation of IP address/port space to sliverswell known portsEfficient vserver sliver creationConfigure new sliver (e.g., to run a minimal script) with minimal I/O overhead, minimal CPU timeCan/should vservers run diskless?Make it easy to share file systems read onlyVserver image provided by symlinks or NFS loopback mounts?
38 More Open Questions What is the agreement with hosting sites? Rack spaceIP address space (\24 per site?)Direct connectivity to InternetBGP peering?Bandwidth to Internet?Local administrative presence?Ability to add resources under local control?Absence of filtering/NATs
39 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
40 Storage for GENIEnables future network applications, which integrate storage, computation, and communicationLarge-scale sensor networksDigital libraries that store all human knowledgeNear-on-demand TVExperiments also need storage:Experiment resultsLogs (e.g., complete packet traces)Huge data sets (e.g., all data in Web)Running the experiment (binaries, the slice data, etc.)Managing GENI requires storage:ConfigurationSecurity and audit loggingStorage will be distributed and shared
41 Storage GoalsEnable experiments that integrate computation and storageProvide sufficient storage (e.g., 200 Petabytes)Provide convenient access to the storage resourcesProvide high performance I/O for experimentsAllow the storage services to evolveAllow experimenters to build new storage servicesBalance expectation of durabilityPermit effective sharing of storage resourcesUser authentication and access controlResource control
42 Overall Storage Design Node-level storage building blocksHigher level distributed storage abstractionsDependencies on authorization, resource managementNodeClusterWide-area storage services
43 Node-level Storage Support Convenient access for experimenters & adminsFile system interfaceSQL database on node (likely)Needed by many apps + GENI managerse.g., auditing systemExtensible access for service creatorsRaw disk / block / extent storeDirect access for building services“Loopback” filesystem supportFacilitate creating distributed storage servicesEfficient use of disk bandwidth
44 Distributed Storage Support Consolidate frequently used data management servicesConvenient administration and experimentationTransparent wide-area file systemData push services: install data “X” on 200 nodesLog storage and collection for research and managementHigh performance distributed I/Oe.g., an improved Google File System (cluster)ok to compromise on application-level transparencyPossible high-peformance wide-area filesystemWrite-once, global high-peformance storageStorage for constrained nodes (e.g., sensors)
45 Storage Deliverables (28E) Local filesystem interface (1E)SQL database (1E)Services for creating new storage services and intercepting storage system calls (1E)Raw disk interface (3E)Block-based storage interface (3E)A wide-area filesystem for administration and experiment management (4E)A high-performance cluster filesystem (4E)Fast write-once/read only storage services. (3E)A reduced complexity storage interface for constrained nodes. (3E)Maintenance and bug fixing throughout life cycle. (5E)* WBS Number: * WBS Task Name:Storage Services* Description:Create the software and hardware infrastructure to allow administrators,the developers of storage-oriented services, and researchers access tostorage resources on GENI. This infrastructure must meet the differentneeds of several groups of users, from access to highly reliable storagefor administrative purposes, to extremely fast storage forhigh-performance experiments, to providing convenient access to remoteresources to facilitate experimentation.* Deliverables:** 1) Local filesystem interface. (1E)The only difficulty here is that it must support the resourceaccounting and authorization model. Assuming that the host OS has beenmodified already, this task should be a simple integration with existingquota mechanisms, depending on the richness of the sharing and neededgranularity of resource accounting.** 2) SQL database (1E)The challenges are as above: resource control integration. Thiscomponent will require selecting and modifying an existing off-the-shelfdatabase, such as mysql or postgres; it will not involve creating newdatabase mechanisms.** 3) Services for creating new storage services and intercepting storage system calls (1E)Our design target for this is a robust "loopback" filesystem interface.Such interfaces exist for Linux and other operating systems. The majorchallenges are: - Integration with management, authentication, and accounting - Ensuring that the interface is robust and stable - Ensuring that the interface can be used across slices to allow developers to build upon each others' services.** 4) Raw disk interface (3E)As above: resource integration and accounting. This mechanism mayrequire more or less effort depending on the level of performancedesired and the granularity of sharing and allocation. Options rangefrom allocating a VM-level share of the disk to allocating large files.The complexity of this task depends on the choices available from theunderlying VM.** 5) Block-based storage interface (3E)Not too hard, but it needs to be fast, and integrated withresource/identity.** 6) A wide-area filesystem for administration and experiment management (4E)We envision this being a port of an existing filesystem to the GENIframework. Suitable options include SFS, AFS, or NFSv4. The challengesagain are integration with accounting and authorization.** 7) A high-performance cluster filesystem (4E)Existing ones are difficult. Could base on the google filesystemclones, or Lustre, or others. But estimate 4 PYs to actually get itworking and robust and the deployment of such things automated.** 8) Fast write-once/read only storage services. (3E)Using things like Shark, DOT, etc. Must settle on one, make it robust.A few person-years.** 9) A reduced complexity storage interface for constrained nodes. (3E)** 10) Maintenance and bug fixing throughout life cycle. (5E)Update storage interfaces as hardware and system software changes.* Dependencies:Security serviceResource allocation policy* Labor costs28 engineers --> 28 engineers 11 QA 5 architects 5 managers* Downscoping:75% Based on nice-to-have features are most complex implementation.* Confidence factor:2XMany elements of the storage services WBS depend on the difficuly ofintegration with resource accounting and authorization, which could varyby up to a factor of two.Deliverable 7: Possibility to purchase could reduce cost, but complexitycould dramatically increase cost.Deliverable 11: This is potentially unbound complexity. We just don'tknow how to estimate cost without additional user feedback onrequirements.* Rationale:Many of the components of the GENI storage services are off-the-shelf ornearly off-the-shelf components that must be modified to work within theGENI resource accounting and authorization framework. When suchcomponents are usable, they are a lower-risk / lower-cost aspect of theproposed design. There is also some work that must be done to each ofthese components to ensure that they are automated sufficiently toscale, from an administrative perspective, to the planned number ofnodes in GENI. Other parts of the services are new or re-engineered:the block-based storage interface, for instance. The logging serviceswill need to be constructed de novo.We removed the high-performance distributed filesystem deliverable. Thistask is very difficult and extremely costly (existing commercialimplementations took 100's of man years to complete)Data push services and logging services moved to communication.
46 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
47 Resource Allocation Goals Define framework for expressing policies for sharing resources among global participantsDesign mechanisms to implement likely policiesResource allocation mechanisms shouldprovide resource isolation among principlesbe decentralizedsupport federation and local site autonomybe secureprovide proper incentivesincentive for participants to contribute resources to the system and to keep them up and runningincentive for participants to use only as much resources as they really need
48 Existing Resource Allocation Model Existing model for PlanetLab resource allocationall resources placed in a central poolall users compete for all resourcesPros:simple, no complex policywell understood, tried and true time sharingDownsides:no incentive for anyone to add additional resourcesno incentive to keep local machines up and runningno incentive for anyone to use less than “as much as possible”all best-effort—can’t reserve fixed share of a node
49 Example Allocation Policies All resources placed in central pool [Supercomputer Center]Portion of resources reserved for dedicated use [SIRIUS]Portion of resources available for bidding [Bellagio]Pair-wise resource peering [SHARP]
50 Resource Allocation Proposal Three piecesGMC runs a centralized Resource Broker (RB)Each site runs a Site Manager (SM)Each component (e.g. node) runs a Component Manager (CM)Site donates some portion of its resources to GENISite’s SM receives a Token of value proportional to value of resources contributedSM subdivides Token among site usersTo access a resourceUser presents token + resource request to RBRB returns Ticket (a lease for access to requested resource)User presents Ticket to resource’s CM to obtain sliver“Back door”: GENI Science Board can directly issue Tokens and Tickets to users and sites
51 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite Manager
52 GENI Resource Allocation Site ManagerSite ManagerResource BrokerDonationDonationSite ManagerSite ManagerSites donate some portion of resources to GENI
53 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite ManagerIn exchange, GENI issues Tokens to each site with value proportional to that of donated resourceseach token carries a value, so 10 tokens of value 1 are equivalent to 1 token of value 10any principal can subdivide tokens
54 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite ManagerUserSite Manager delegates some resource privileges to user by issuing a Token of smaller denomination
55 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite ManagerUserResource Discovery ServiceUser consults any resource discovery service to locate desired resources
56 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite ManagerUserUser presents Token and resource request to Resource Broker
57 GENI Resource Allocation Site ManagerSite ManagerResource BrokerSite ManagerSite ManagerUserResource Broker (possibly after consulting with Component Managers on requested resources) returns one Ticket for each requested resourceTicket is a lease: guarantees access to resource for period of time
58 GENI Resource Allocation Site ManagerGENISite ManagerResource BrokerCMCMCMCMCMCMSite ManagerSite ManagerUserCMCMCMCMCMCMUser presents each Ticket to a Component Manager, receives Sliver (handle to allocated resources)
59 GENI Resource Allocation Site ManagerSite ManagerScience BoardSite ManagerSite ManagerGENI Science Board can directly issue Tokens and Tickets to users and sites to reward particularly useful services or hardware
60 Additional Details: Donations <xsd:complexType name="Donation"> <xsd:sequence> <xsd:element name="GUID" type="xsd:string"/> <xsd:element name="Recipient" type="xsd:string"/> <xsd:element name="RSpec" type="tns:RSpec"> <xsd:element name="Signature" type="xsd:base64Binary"/> </xsd:sequence></xsd:complexType><xsd:complexType name="RSpec"> <xsd:sequence> <xsd:element name="Issuer" type="xsd:string"/> <xsd:element name="Resources" type="tns:ResourceGroup"/> <xsd:element name="IsolationPolicy" type="tns:IsolPolicy"/> <xsd:element name="AUP" type="tns:AUP“> <xsd:element name="ValidStart" type="xsd:dateTime"/> <xsd:element name="ValidEnd" type="xsd:dateTime"/> </xsd:sequence>Donation recipient is Resource BrokerRSpec specifies resources donatedFor now, we assume at least one donation message per node being donated (since the existing GMC type ResourceGroup specifies a single node); eventually, would like some way to specify aggregates (e.g. a whole cluster)We assume RSpec specifies supported sharing policies for donates resources (guaranteed-share, best-effort proportional share, etc.)IsolPolicy and AUP are just Strings for now, since we don’t know exactly what they will look like
61 Additional Details: Tokens <xsd:complexType name="Token"><xsd:sequence><xsd:element name="Issuer" type="xsd:string"/><xsd:element name="GUID" type="xsd:string"/><xsd:element name="Recipient" type="tns:SliceName"/><xsd:element name="Value" type="xsd:decimal"/><xsd:element name="ValidStart" type="xsd:dateTime"/><xsd:element name="ValidEnd" type="xsd:dateTime"/><xsd:element name="ParentGUID" type="xsd:string"/><xsd:element name="Signature" type="xsd:base64Binary"/></xsd:sequence></xsd:complexType>Token issuer is Resource Broker, Recipient is a slice (or, in “back-door” process, Token issuer is Science Board, recipient is a slice)ParentGUID allows tokens to be chained together. For example, the token a SM gives a user will be chained to the token the RB gave to the SM (which the SM subdivided). This allows detection of SMs who manufacture more Tokens than allowed, etc.Meaning of value is not pre-ordained; it is up to the Resource Broker to determine value required for different kinds of tickets for different resources
62 Additional Details: Tickets <xsd:complexType name="Ticket"><xsd:sequence><xsd:element name="GUID" type="xsd:string"/><xsd:element name="Recipient" type="tns:SliceName"/><xsd:element name="RSpec" type="tns:RSpec"/><xsd:element name="ValidFor" type="xsd:duration"/><xsd:element name="Signature" type="xsd:base64Binary"/></xsd:sequence></xsd:complexType>RSpec specifies resources guaranteed to userTicket issuer is Resource Broker (or, in back-door process, Science Board)Ticket recipient is a sliceValidFor specifies how long the user can use the specified resources; allows authorization policies such as “you can use the node for any 30-minute period this week” (RSpec time spec indicates the “this week” part, while Ticket ValidFor specifies the “30-minute” part)
63 Implementing RA policies Current PlanetLab RA policy (per-node proportional share)Site Manager donates nodesSM receives >= N*M TicketsN=# of PL nodes, M=# users at siteSM gives each user N Tokens of value 1User presents one Token of value 1 and a resource request to RBRB returns a Ticket authorizing prop.-share use of requested nodeUser presents Ticket to CM, which returns Sliver on its nodeUser’s share = 1/P where P=number of users (slivers) on the nodeWeighted proportional shareAs above, but user presents Token of value T to RB (T may be > 1)User’s share = T/Q where Q=number of Tokens redeemed by other slivers that are using the node
64 Implementing RA policies (cont.) User wants guaranteed share of a node’s resourcesUser presents token of value T + resource request to RBRB returns a Ticket for guaranteed T% share of requested nodeUser presents Ticket to CM, which returns Sliver on its nodesliver is guaranteed a T% share of the node’s resourcesBut if RB has already committed more than 100-T% of the node, either1) RB refuses to grant Ticket, then(a) user tries again later, or(b) user tries again immediately, specifying a later starting time, or(c) out-of-band mechanism used to queue the request and issue callback to user when T% of the resource is available2) Or, RB grants the Ticket, setting ValidFor to requested duration; user presents Ticket at any time between ValidFrom and ValidTo(1a) and (1b) the user tries to guess when the resource will be available(1c) is current Emulab batch scheduling model(2) is a model where user says I will run whenever you want, but I need to know now when that will beBenefit of (1a) and (1b) is that user keeps control of when will have access; in (1c) and (2) the RB dictates to the user when the access will be granted
65 Implementing RA policies (cont.) Resource auctionsRB coordinates the bidding“Cost” of using a resource is dynamically controlled by changing “exchange rate” of Token value to Ticket shareLoan/transfer/share resources among users, brokers, or sitesTokens are transferrable, ParentGuid traces delegation chainSites and users can give tokens to other sites or users
66 Resource Alloc Deliverables (17E) Public API to set resource privileges on per-user/per-site basis.Public API to set use privileges on per-component basis (for site admins).Initial web-based interface to allow GENI Science Council to set per-user/per-site privileges using API in step 1.Initial web-based interface to allow administrators to set policy for locally available components.Refined versions of 1, 2, 3, 4 above based on user and community feedback.Design of capabilities to represent GENI resources.Design of tickets representing leases for access to individual GENI components.Initial implementation of Resource Brokers and client software to present requests for resources and to obtain the appropriate set of tickets.Site administrator API and web interface to assign privileges on a per-user basis.Integration with resource discovery service.Integration with experiment management software.
67 Open Issues Specifying resource aggregates (e.g. a cluster) Multiple, decentralized RBs rather than a single centralized RB run by GENIDescribing more complex sharing policiesBuild and deploy real implementationSite Manager, Resource Broker, Component Manager as SOAP web servicesbuild on top of existing GMC XSD specifications
68 Resource Allocation Conclusion Goal: flexible resource allocation framework for specifying a broad range of policiesProposal: centralized Resource Broker, per-site Site Managers, per-node Component ManagersPropertiesrewards sites for contributing resourceswith special back-door to give users and sites bonus resourcesencourages users to consume only the resources they needallows to express a variety of sharing policiesall capabilities (donations, tokens, tickets, slivers) time outallows resources to be garbage collectedallows dynamic valuation of users and resourcescurrently centralized, but architecture allows decentralizationsecure (all capabilities are signed)
69 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
70 Experimenter’s Support Toolkit Make it easy to set up and run experiments on GENIGoal: make GENI accessible to the broadest set of researchers, including those at places with little prior institutional experienceSupport different types of experiments/users:Beginners vs. expert programmersShort-term experiments vs. long-running servicesHomogeneous deployments vs. heterogeneous deployments
71 Typical Experiment Cycle ApplicationObtainResourcesConnect ToResourcesPrepareResourcesStart/MonitorProcessesClean UpResourcePoolPicture will look different for long-running services as process monitoring,resource preparation, etc. will proceed in a cycle
72 Desired Toolkit Attributes Support gradual refinement:Smooth implementation path from simulation to deploymentSame set of tools for both emulation and real-world testingMake toolkit available in different modesStand-alone shellLibrary interfaceAccessible from scripting languagesEnable incremental integration with other servicesFor instance, should be able to change from one content distribution tool to another by just changing a variableSophisticated fault handlingAllow experimenters to start with controlled settings and later introduce faults and performance variabilityLibrary support for common design patterns for fault-handling
73 Toolkit as an Abstraction Layer Long runningServicesEntry-level usersServicesrequiring fine-grained controlShellScripting LanguageAPI ComponentsExperimentInstantiationJob ControlI/OexceptionsDebugging,TransactionsSupportend-hosts,heterogeneityScalability,ResourcediscoveryLANClustersEmulabModelnetGENI
74 Basic Toolkit Components System-wide parallel executionStart processes on a collection of resourcesIntegrate support for suspend/resume/killIssue commands asynchronouslySupport various forms of global synchronization (barriers, etc.)Node configuration tools:Customizing node, installing packages, copying executables, etc.Integrate with monitoring sensorsDistributed systems sensors such as slicestat, CoMonInformation planes for network performance (such as iPlane)Integrate with other key servicesContent distribution systems, resource discovery systems, etc.
75 Advanced ComponentsKey stumbling block for long-running services is ensuring robustness in the presence of failuresNeed to provide support for incremental resource allocationLibrary support for common design patterns to handle faultsSupport for transactional operations and two-phase commits, support “execute exactly once” semantics, etc.Support for detecting abnormal program behavior, application-level callbacks, debugging, etc.Reliable delivery of control signals, reliable delivery of messages
76 Experiment Support (30E) Tools for performing system-wide job control: such as executing the same command on all nodes with desired levels of concurrency, etc.Tools for performing operations in asynchronous manner and synchronizing with previously executed commands.Tools for setting up necessary software packages and customizing the execution environment.Tools for coordinated input-output (copying files and logs).Exposing the toolkit functionality in a library API.Exposing the toolkit functionality using a graphical user interface (6-8E)Integration of tools into scripting languages.Provide simple ways for users to specify desired resources.Resilient delivery of control signals.Provide transactional support for executing system-wide commands.Provide support for detecting faults in experiments.Scalable control plane infrastructure -- dissemination of system-wide signals, coordinated I/O, and monitoring program execution should all be done in a scalable manner (3-5E)Interface with content distribution, resource discovery, slice embedding systems.Interface with the information plane for communication subsystem and various sensors monitoring the testbed.Tools for checking for global invariants regarding the state of a distributed experiment (4E)Logging to enable distributed debugging.Debugging support for single-stepping and breakpoints.Rationale:Most of the deliverables mentioned above can be performed with asingle person-year's amount of work. Exceptions are listed below.Task #6 is 6-8 person years, based on the Emulab experience asgraphical user interfaces are both critical and hard to get right thefirst time around. Task #12 is 3-5 person years as it takes asubstantial amount of software to provide a scalable infrastructure.Task #15 is probably 3-4 person years and could be potentiallydownscoped. Maintaining code and folding in user feedback is also asubstantial task and is potentially difficult to estimate.NOTE: this task is substantially harder than in planetlab because thetools must interact with many different kinds of devices (access points,routers, xen domains, etc).Can be downscoped: 100%Novelty: 20%Confidence Factor: 2x
77 Slice Embedding Deliverables (25E) 1) Resource specification language for describing user's needs.2) Generic matching engine.3) Algorithms for efficient matching.4) Matching engine for each subnet.5) Stitching module to compose results from different subnets.6) Integration with the resource discovery system to identify available resources.7) Integration with the resource allocation system to ensure allocation.RATIONAL:The initial version of the API could provide basic functionality: justsupport allocation of machines/compute time, rather than more complexresources such as wireless frequency, and can use simple algorithmsfor performing the matching even if it is expensive. The initialrollout should however have the module interface with resourceallocation and resource discovery. I am estimating 5 people-years forthis task. Subsequent work can focus on developing subnet specificmatching engines and resource-specifications, accounting for about 10people years. Finally, I am budgeting 5 people years for developingscalable algorithms that can work even when the user specifies complexconstraints on the resources desired. 5 more people-years for foldingin community feedback and maintaining the software. The work onbetter algorithms could be down-scoped. Interestingly enough, betteralgorithms and a rich resource specification framework accounts forthe novelty in this work. This is a service critical to ensure smoothoperation of GENI.Can be downscoped: 100%Novelty: 40%Confidence Factor: 3x
78 Topics Security architecture Edge cluster hardware/software definition Storage servicesResource allocationExperiment supportOperations supportCommunications substrateLegacy Internet applications support
79 Monitoring Goals History Metrics Triggers Reduce cost of running system through automationProvide mechanism for collecting data on operation of systemAllow users to oversee experimentsInfrastructure (i.e., node selection, slice embedding, history, etc.)HistoryClusters, Grid – GangliaPlanetLab – CoMon, Trumpet, SWORDMetricsNode-centric: CPU, disk, memory, top consumersProject-centric: summary statistics (preserves privacy)TriggersNode, project activity “out of bounds”Warning messages, actuatorsCombinations with experiment profiles
80 Operations Support Issues Two categories of support systemsOnline: monitor the function and performance of GENI components in real-timeUse the ITU FCAPS model to classify necessary support systemsOffline: problem tracking, maintenance requests, and inventoryBuild or buy decisionsFirst preference is to use open-source if available, appropriate, and competitiveDevelop re-distributable extensions as appropriateSecond preference is to purchase COTS softwareEvaluate cost per seat, educational discounts, and impact of restricted access to system dataLast choice is to build systems from scratch if no suitable alternatives exist
81 FCAPS (Fault, Configuration, Accounting, Performance, Security) Fault managementDetect and track component faults in running systemInitiate and track the repair processExample systems: Nagios, HP OpenView, Micromuse NetcoolConfiguration managementAutomate and verify introduction of new GENI nodesProvision and configure new network linksTrack GENI hardware inventory across sitesExamples: PlanetLab boot CD, Telcordia Granite Inventory, Amdocs Cramer Inventory, MetaSolvAccountingManage user and administrator access to GENI resourcesMap accounts to real people and institutionsExamples: PlanetLab Central, Grid Account Management Architecture (GAMA)
82 FCAPS (Fault, Configuration, Accounting, Performance, Security) Performance managementFine-grained tracking of resource usageQueryable by administrators and adaptive experimentsDetecting and mitigating transient system overloads and/or slices operating outside their resource profilesExamples: CoMon, HP OpenView, Micromuse NetcoolSecurity managementLog all security-related decisions in an auditable trailViewable by cognizant researcher and operations staffMonitor compliance with Acceptable Use PolicyTry to detect certain classes of attacks before they can cause significant damageExamples: Intrusion detectors, compliance systems, etc.
83 Problem TrackingAll researcher/external trouble reports, plus any traffic incident reportingExamples: this filesystem seems corrupt, this API does not seem to match the behavior I expect, or “why did I receive this traffic?”Receive alerts/alarms from platform monitoring system (e.g., Nagios, OpenView, etc.)Track all reported alarms, delegate to responsible parties, escalate as neededClassify severity, prioritize development/repair effortExamples: Request Tracker (RT), Bugzilla, IBM/Rational ClearQuest
84 Operations Support (31E) 1) GENI Fault Management System software (4E)2) GENI Configuration Management System software (4E)3) GENI Accounting Management System software (2E)4) GENI Performance Management System software (3E)5) GENI Security Management System software (2E)6) GENI Problem Tracking System software (2E)7) GENI Community Forum software (2E)8) Lifecycle management of all software components (12E)Novelty:25% of the work will require innovation. While many of the OperationsPortal functions are reasonably well-understood for conventionalnetworks, the programmability, virtualization, and scale inherent inGENI will require new ideas in Operations. Fault and performancemanagement for GENI will require innovation. In particular, newresearch will be needed to determine the best way of performingroot-cause analysis of faults in a system like GENI.Confidence Factor: 3XRational:We assume that the GENI Fault, Accounting, Performance, and SecurityManagement systems can be derived from existing open-source tools thatprovide similar functions in other environments. The GENI ConfigurationManagement System may need to be built from scratch or derived fromcommercial software. The Problem Tracking System and GENI CommunityForum can likely be derived with little modification from existingopen-source software that performs similar functions.**Deliverable 8: We assume 4E per year recurring costs for three years tomaintain and upgrade software components.
85 Communication Substrate Bulk data transfer.Small message dissemination (e.g., application level multicast) for control messagesLog/sensor data collectionInformation plane to provide topology information about both GENI and the legacy InternetSecure control plane service running on GENIso that device control messages traverse over the facility itself, and therefore cannot be disrupted by legacy Internet traffic.essential if the facility is to be highly available.
86 Communication Deliverables (11E) 1. Bulk data transfer (e.g., CoBlitz or Bullet), to load experiment code onto a distributed set of machines (3E)2. Small message dissemination (e.g., application level multicast) for control messages to a distributed set of machines (1E)3. Log/sensor data collection, from a distributed set of machines to a central repository/analysis engine (3E)4. Information plane to provide topology information about GENI and the legacy Internet (1E)5. Software maintenance and upgrades (3E)Downscoping: 100%If these items are left off, the downside will be lower efficiency inutilizing resources, but the system will continue to function.Novelty: 20%Depending on the final requirements, the log collection service willrequire some thinking to make it work well.Confidence factor: 2xRationale:For budget reasons, we eliminated a deliverable, the secure controlplane service. This service uses GENI resources, so that device controlmessages traverse over the facility itself, and therefore cannot bedisrupted by legacy Internet DoS attacks. Since this service can bebuilt as a service running on GENI, the lack of the service does notprevent an experimenter who needs it from building it themselves.
87 Legacy Services Virtualized HTTP (2E) Virtualized DNS (2E) Allow experiments to share port 80Virtualized DNS (2E)Allow experiments to share port 53Client opt-in (12E)Assumes Symbian, Vista, XP, MacOS, Linux, WinCEDistributed dynamic NAT (2E)Connections return to sourceVirtualized BGP (backbone group)
88 Prioritization High priority (“can’t live without it”): 60E Data transfer to set up experimentsLocal storageResource allocation/mgtOperations supportMedium priority (“should have”): 50ELegacy servicesQuick and dirty experiment supportEfficient disk storageLog data collectionNice to have: 40EInformation planeSimplified file system interfaces for executionMore functional workbenchSlice embeddingNotes:Security and edge cluster prioritized elsewhere (part of GMC)Prioritization also needed within each functional area (e.g., ops support)Procedurally, we estimated the budget #s in the following way. We extracted from the various GENI documents that the working group had written, the list of concrete deliverables for each portion of the services effort. Mic Bowman and I came to a consensus estimate of the number of person-years of *developer* effort that would be required to develop that piece of functionality. That is the #E value next to each deliverable. You should realize that number is incredibly squishy, as we often had little more than a one paragraph description of the deliverable, and there's an order of magnitude difference in cost that is possible for a given component based on how thoroughly it is engineered, how complex the API is, etc. We were somewhat, but not overly I think, conservative in estimating the amount of developer time. One reason for being conservative is that at the time we did the estimates, there was no process in place for tracking requirements/interfaces, and therefore we had little assurance we could bound the complexity of each piece (requirements that might be tacked on that would drive up the amount of effort). For example, the project had made no effort to minimize the number of unique pieces of hardware that the software would need to be ported to, which would obviously drive up the cost (at the very least, of the testing effort). Similarly, if we were shooting for a bare bones workable system, the costs would go down significantly. However, we had no way of knowing which world we were operating in, and I think that's still true.We then took the developer effort and did a mechanical expansion to determine the final budget #. To each developer-year, we added 0.4 tester-years, 0.2 manager-years, and 0.2 architect (technical lead) years. This is based on Mic and my estimate of standard practice in industry; academic efforts like PlanetLab use fewer testers and architects/technical leads (so Larry's numbers for these are different), and whether that is a good or a dumb idea is open to debate. There is another factor of 2 added in the spreadsheet due to overhead. We then somewhat arbitrarily assigned a factor of 2 uncertainty factor (in reality, we could have defended 10x uncertainty given the level of specificity we have). In some of the text portions, there is some discussion as to which deliverables have the greatest technical risk, but that's again only a ballpark estimate, and in most cases, we kept the 2x factor regardless, as the minimal defensable uncertainty (that is, there exists a process for which that would be the uncertainty). I believe it would be possible to design a useful system for the budgeted time/effort; whether it is less or more is a question of priorities and having a rational process for managing the list of required features. In order to get a much closer number, we would need to have a lot more detail about each component.
89 ConclusionWe understand most of what is needed to build security, user support into GENILots of work still to do to refine designComments welcomeDesign not intended as a fixed point