Presentation on theme: "Grid’5000 A Nation Wide Experimental Grid. Grid’5000 Grid raises a lot of research issues: Security, Performance, Fault tolerance, Load Balancing, Fairness,"— Presentation transcript:
Grid’5000 A Nation Wide Experimental Grid
Grid’5000 Grid raises a lot of research issues: Security, Performance, Fault tolerance, Load Balancing, Fairness, Coordination, Message passing, Data storage, Programming, Communication protocols and architecture, Deployment, etc. Theoretical models and simulators cannot capture real life conditions Production platforms have strong difficulties to reproduce experimental conditions How to test and compare? Fault tolerance protocols Security mechanisms Deployment tools etc. Grid: Distributed System Problematic renewal
Grid’5000 log(cost) log(realism) mathsimulation emulation live systems Models: Sys, apps, Platforms, conditions Real systems Real applications Real platforms Real conditions Tools for Distributed System Studies To investigate Distributed System issues, we need: 1) Tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between these research tools Tools for Large Scale Distributed Systems Real systems Real applications “In-lab” platforms Synthetic conditions Key system mecas. Algo, app. kernels Virtual platforms Synthetic conditions
Grid’5000 log(cost) log(realism) mathsimulation emulation live systems SimGrid MicroGrid Bricks NS, etc. Model Protocol proof Grid eXplorer WANinLab Emulab Grid’5000 TERAGrid PlanetLab Naregi Testbed We need a Grid experimental platform According to the current knowledge: There is no large scale testbed dedicated to Grid experiments Grid’5000 as a live system Grid eXplorer as a large scale emulator
Grid’5000 1)Remotely controllable Grid nodes installed in geographically distributed laboratories 2)A « Controllable » and « Monitorable » Network between the Grid nodes 3)A middleware infrastructure connecting the nodes (security) 4)A playground to prepare experiments 5)A toolkit to deploy, manage, run experiments and collect results What do we need for Grid experiments ?
Grid’5000 1)Building a nation wide experimental platform for Grid researches (like a particle accelerator for the computer scientists) 10/11 geographically distributed sites every site hosts a cluster (from 256 CPUs to 1K CPUs) All sites are connected by RENATER (French Academ. Network) RENATER hosts probes to trace network condition load Design and develop a system/middleware environment for safely test and repeat experiments 2) Use the platform for Grid experiments Address critical issues of Grid system/middleware: Programming, Scalability, Fault Tolerance, Scheduling Adress critical issues of Grid Networking High performance transport protocols, Qos Port and test applications Investigate original mecanisms P2P resources discovery, Desktop Grids The Grid’5000 Project
Grid’5000 Grid’5000 Schedule Grid’5000 Hardware Call for proposals Sept03 Selection of 7 sites Nov03 ACI GRID Funding Jan04 Call for Expression Of Interest March04 Vendor selection Jun/July 04 Instal. First tests Spt 04 Final review Oct 04 Fisrt Demo (SC04) Nov 04 Grid’5000 System/middleware Forum Security Prototypes Control Prototypes Grid’5000 Programming Forum Grid’5000 Builder Community Grid’5000 Experiments
Grid’5000 Grid’5000 Funding (ACI + Local District/Prefecture) Grid’ ,6M€ ~0,4€ ~0,35€ ~0,5€ ~0,35€ ~0,3?€ ~0,5€ ~3M€ for hardware only
Grid’5000 SSH commands Lab’s Network LAB/Firewall Router Test Cluster Control Master Site 1 Site 2 Site 3 Users (ssh loggin + password) Firewall/nat Control Slave Test Cluster Boot server + dhcp -rsync (kernel,dist) -orders (boot, reset) Front end Control Slave Control site Grid’5000 prototype (Control) System kernels and distributions are downloaded from a boot server. They are uploaded by the users as system images.
Grid’5000 Lab’s Network LAB/Firewall Router Test Cluster Control Master Site 1 Site 2 Site 3 Users (ssh loggin + password) Firewall/nat Control Slave Test Cluster Front end Control Slave Control site Grid’5000 prototype (test) Gateway +VPN (192. For all nodes) One machine Can be seen as a Virtual Grid Gateway
Grid’5000 Grid’5000 control - Experiment automation - VGrid « mapping a virtual Grid on a real testbed » Fault tolerance - Fault tolerant Grid-RPC (RPC-V) - Hierarchical Fault tolerant MPI (MPICH-V) MPI Environment - Time sharing Grid resources - Migration over Clusters with heterogeneous high speed networks Global Computing/P2P Middleware (XtremWeb) - Executing Web Services on Desktop Grid Workers - Distributing the Coordination in Desktop Grids - Harnessing Clusters as parallel Workers
Grid’5000 (GRAAL Team) Large scale experimentation of the DIET platform (Distributed Interactive Engineering Toolbox) Client/Agent/Server model following the GridRPC standard with distributed scheduling agents Hierarchical and Distributed Scheduling -Validation of distributed scheduling (hierarchical, distributed, duplicated) and associated heuristics Automatic Deployment -Agents mapping depending of the characteristics of the target architecture (network hierarchy, dynamicity, performance variations) Mixed Parallelism (task and data parallelism) -Scheduling of requests one after another (online scheduling) or using sets of requests (static knowledge of the dependences and application graph) Mixing data management and task scheduling -Replica management, data mapping and redistribution between clusters, task graphs
Grid’5000 End Host Communication layer - Intelligent Usage of NICs for local and wide area communications - Direct file access over Myrinet : ORFA/NFS and ORFA/LUSTRE High performance long distance protocols - Alternative Transport for very high speed networks (backpressure) - Differentiated transport with delay control on WAN - Reliable active and non active Multicast High Speed Network Emulation - Automatic Deployment of emulated high speed domains - Experiment design for grid flow interactions studies Grid Networking Layer - Network Resource and QoS on demand - Grid Overlay and Programmable Routers - Measurement Services for network aware middleware (RESO Team)
Grid’5000 Large-scale scientific computing applications (CFD, astrophysics,…) Multi-parametric applications - ACI GRID-TLSE Project: expertise site for sparse linear algebra - Climate modeling and Global Change - DataGène Project : Functional genomic Management - AROMA tool : management of resources over a Grid of clusters with different classes of services - Mobile agents for open Grid management - Management of Grids and hosted services (security, QoS, monitoring & control, dynamic configuration, …) - Optimization for wide area distributed query processing - Virtualization of data storage on Grids
Grid’5000 JAVA programming on the grid (Oasis project-team) –ProActive: a JAVA library for parallel, distributed, concurrent computing with security and mobility –Assessment of scalability, deployment, security and fault tolerancy issues –Hierarchical components architecture Large scale experimentation of distributed applications –JECS: a JAVA Environment for Computational Steering Distributed computing and interactive visualization of 3D numerical simulations (Caiman and Oasis project-teams) Collaborative environment Computational Electromagnetism application (JEM3D) –MECAGRID (ACI GRID project, Smash project-team) Massively parallel computations in multi-material fluid mechanics Study of numerical algorithms for heterogeneous computing platforms –Grid computing for medical applications (Epidaure project-team) Interoperable medical image registration grid service –Optimal design of complex systems (Coprin project-team) Evaluation of parallel optimization algorithms based on interval analysis techniques Study of load balancing strategies on heterogeneous resources
Grid’5000 Grid’5000 control -Environment computing deployment (Ka-tools) Grid Computing - Multi-clusters and lightweights Grid ressource managment (OAR/CIGRI) - Grid file system (NFSG) - Scheduling : Data transferts, global communications, hierarchical workstealing,... - Monitoring, benchmarking, performance characterisation and analysis Code Coupling - Application coupling with Athapascan - Communication / method invocation rescheduling into ORB (HOMA) Fault tolerance / Security - Fault tolerant in data-flow approach (Athapascan) - Probabilistic certification in peer-to-peer systems Miscellaneous - Collaborating tools in virtual 3D environement.
Grid’5000 Middleware -PadicoTM/Paco++ combining parallel and distributed computing (ACI GRID RMI). -Scalability of consistency protocol in DSM (ACI MD GDS), -Experimenting management services for texual document in P2P systems, -Data re-distribution in Grid (ARC INRIA ReDist), -Coupling Computational Grid with Reality Center (INRIA SIAMES, RNTL SALOME2) -Task distribution and load balancing in heterogeneous Grid (ACI GenoGrid) Code coupling -Fluid transfer simulation and geological code with PadicoTM/Paco++ (ACI GRID HydroGrid). Networking -Network Bandwidth optimization in Grid (VTHD++, Paco++).
Grid’5000 Runtime systems for clusters of clusters and grids (RUNTIME INRIA Team) - High performance communication across heterogeneous networks - Communication libraries: Madeleine, MPICH/Madeleine - Efficient use of high speed networks (Myrinet, Infiniband, SCI) - Fast forwarding and Multiplexing of data on gateway nodes Distributed Systems and Objects (SOD Team) - Tools to support the development, administration and usage of heterogeneous resources over the Grid Large-scale scientific computing applications (SCALAPPLIX INRIA Project) - Fluid mechanics, molecular dynamics and host-parasite systems in population dynamics, etc. Steering of numerical simulations (ACI GRID-EPSN Project) - Parallel on-line visualization / monitoring - Data Redistribution - Computational Steering by direct image manipulation