Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN IT Department CH-1211 Genève 23 Switzerland Batch virtualization project at CERN The batch virtualization project at CERN Tony Cass,

Similar presentations


Presentation on theme: "CERN IT Department CH-1211 Genève 23 Switzerland Batch virtualization project at CERN The batch virtualization project at CERN Tony Cass,"— Presentation transcript:

1 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN The batch virtualization project at CERN Tony Cass, Sebastien Goasguen, Ewan Roche, Ulrich Schwickerath HEPIX2009, Berkeley See also: Virtualization, Ewan Roche, HEPIX spring 2009 Virtualization vision, Tony Cass, GDB 9/9/2009 and HEPIX Virtualisation of Batch Services at CERN, Sebastien Goasguen Batch virtualization at CERN,Ulrich Schwickerath, EGEE09 conference, Barcelona,

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 2 Outline Service consolidation and batch virtualization Motivation for batch virtualization Requirements and basic concepts Initial phase prototype description Production deployment Vision beyond phase 1 Summary

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 3 Service consolidation and batch virtualization Virtualization for service consolidation: Up to ~few 100 machines (today) Typically little CPU usage on these boxes Support for live-migrationrequired Based on reliable hardware Critical services Virtualization for batch (and similar) applications: Large scale, O(several 1000) machines High CPU usage, number crunching Limited life time is OK! Cheap batch hardware is OK! Individual machines are not critical Option for “cloud” like infrastructure This talk Juraj Sucik

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 4 Batch virtualization: Why ? User perspective: Customization of images for specific use cases Allows skipping of initial consistency checks (context: pilot jobs) Possibility to run old code for longer time Operations perspective: Strict encapsulation of user jobs in their sand box Decoupling of applications from underlying hardware Recent hardware only works with recent OS versions Complex applications are difficult to port to new OS versions Easy transition from one OS to another Dynamic change of worker node types dependent on requirements Better use of resources Easier handling of intrusive updates and operations Roll-out of important security updates in a rolling way

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 5 Requirement s Backward compatible approach No change for traditional users Transparent migration Start small and grow Seamless integration into the existing system Less work load after deployment for administrators Possibility to mix virtual and real resources Make maximum use of what exists already ! Operational requirements: Infrastructure: No changes to the infrastructure itself Full integration into the existing systems and databases Vision: Be fully backward compatible while opening the door to explore new computing models

6 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 6 Basic concepts and proof-of- concept Hypervisors: Unique set of machines, organized in a new cluster “lxcloud” Fully Quattor managed and Lemon monitored Very restricted access (no user access!) Minimal software setup Focus on XEN for stability, KVM support in preparation Golden Nodes: Fully Quattor managed VM with worker node setup Running LSF software Don't run jobs but get updated Serve as templates for the creation of images Frequently being snapshotted, eg after updates Worker nodes: Virtual machines with public IPs Derived from Golden nodes Not quattor managed, stable during life time Dynamically join the existing LSF cluster and accept jobs Limited life time (draining after 24h and shutdown when idle) First step: adding virtual batch worker nodes to lxbatch at CERN

7 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 7 Proof of concept setup Phase 1 proof of concept is done, with some additions: Successful launch of CERNVM images as well Testing of OpenNebula Cloud interface ongoing XEN hypervisors LSF master CREAM CE VMO/OpenNebula Join Virtual worker nodes coupling Image source, VM kiosk GRID world Jobs Golden nodes

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 8 Initial phase: supported images Step 1: fully backward compatible approach SLC4/64bit with 32bit compatibility SLC5/64bit with 32bit compatibility 2GB RAM + some work space 1CPU only Supported images are 1:1 snapshots of current worker nodes Step 2: support also specialized images More CPUs More Memory e.g. 2 CPU/4GB, 4 CPUs/8GB etc On user demand

9 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 9 Image selection and VM placing Image selection: Driven by user demand Requirements of pending jobs in the queues Historical data Must respect user priorities Requires coupling between VM management software and batch system Can be tricky to implement VM placing: Simple load driven resource allocation manager is fine We rely on existing solutions Both commercial and free solutions are available Looked at Platform VMO (considering ISF) and OpenNebula Both have their pros and cons

10 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 10 Image selection Need a meta-scheduler which connects the batch system to the image deployment mechanism. Requirements: Optimization goal must be efficient use of resources Respecting fair share and queue priorities Light-weight Minimizing number of queries to the batch system Status: Existing prototype in VMO/ISF, developed by Platform with/for CERN Based on batch system queries, not really integrated into LSF Simple prototype for OpenNebula done by Sebastien Goasguen Ideas for a better design: More sophisticated approach is needed, current system may not scale well Could be slow-control and run entirely offline Respect boundaries: offer a minimum and maximum number of specific VMs Use historical data instead of online requests Get it right on average within boundaries, no short term big changes

11 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 11 VM placing: VMO versus OpenNebula OpenNebula: Intuitive user interface No inherent integration with any batch system Used in the prototype at CERN as well Free software, public sources so easy to debug Easier to integrate into the existing infrastructure at CERN Support for EC2 (work ongoing as we speak) VMO (Virtual Machine Orchestrator) Uses EGO as resource manager (same as LSF version 7.X) Commercial product Professional support Existing prototype for LSF-VM coupling Some issues due to CERN networking boundary conditions Demonstrated to work at the multi-threading work shop at CERN

12 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 12 GRID integration in proof of concept Proof of concept setup: Up to 18 nodes acting as hypervisors LSF test setup with its own master 1 management host (for OpenNebula) 1 CREAM CE Only very minor modifications to the CREAM CE needed A clone of the production CEs at CERN (but not further advertized) Successfully tested by Alice using direct job submission to the CE Remark: selecting the OS version by the user (via JDL) works as well. To be able to fully exploit this, more modifications on the information system on the CE will be needed.

13 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 13 Proof of concept demonstration Number of VMs/Jobs Number of jobs starting VMs VMO LSF coupling tests: suspended VMs Running VMs time

14 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 14 Going to production … Current status: Lxcloud Selected and installed 10 new powerful CPU servers Organized in new cluster lxcloud 24GB of RAM, 2TB disk space, dual Nehalem CPUs Partitioning and LVM setup being finalized Starting with XEN Network setup (public IP services for the VMs) is being worked on Hardening and packaging of initial prototype scripts started Planning of operational procedures started VM kiosk Planning phase, will start with simple setup Different options for image distribution are under discussion Goal: golden nodes should/could be managed within the service consolidation project Important milestones Prototype of dynamically adapting system: December 2009 Production quality release: March 2010

15 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 15 Challenges for production … Challenges to be addressed include: Hardware selection: A 16GB RAM hypervisor cannot run 8 2GB VMs (!) KVM may help, stability unclear Current images are huge Attach temporary space per VM from the hypervisor The virtualization kiosk (=image repository) Scalability is a concern Idea: move images asynchronously to the hypervisors if changed Cache them on the hypervisor Deployment mechanism needs to know which HV has which images IP adresses: We definitely need outbound connectivity CERN can use public IP addresses, organization is a challenge

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 16 Beyond phase 1 … VM Visions Reusing images outside CERN (phase 2) Phase 1 images are highly CERN customized Requires trusted portable base image... … plus a more complex contextualization phase at the site A possibility for the site to add site specific software without compromising the image integrity Experiment specific images (phase 3) As before but this time the software environment gets customized for a specific experiment Removes the need for pilot jobs to check the environment which is already correct by construction Requires a good control of the experiment software stack to avoid compromising the image integrity User defined images / CERN-VM Use of resources in a cloud like operation mode (phase 4) Images join different batch farms at startup (phase 5) Controlled by experiment Spread across sites Possibly replace current pilot job frame work

17 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 17 Contextualization phase Happens at instantiation time of the VM to, usually : Setup networking Setup batch resources Setup monitoring... Can we extend this phase, to allow reusage of images across sites ?

18 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 18 Cross – site trusted images Idea/Goal: Single source for images Usable across sites Trusted Signed Managed, i.e. security issues fixed Blessed and tested by the experiments (!) Challenges: Centralized VM kiosk, providing signed images Templates need to be portable authentication/user accounts for the jobs running in the images Complex contextualization phase allows for site addons: Add software, eg monitoring software Different batch scheduler Import home directories/global file systems Customize root access (preferable: none at all) Configure the node

19 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 19 Cross – site trusted images Some initial thoughts about possible options: Create additional “generic” golden nodes Supporting all possible batch systems Could be VO specific images Allow for site-dependent contextualization May result in huge images Difficult to remove all site dependencies while maintaining the image Start from current golden nodes Remove all site dependencies first Perform site dependent contextualization, including site dependent software Site dependent addons may break the software setup Start from CERN-VM images Complex site-dependent contextualization may break the image On-site backward compatibility may have to be dropped

20 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Batch virtualization project at CERN - 20 Summary, conclusion and plans CERN has developed a concept for a migration of batch services to virtual machines Start small then grow Be backward compatible Open the doors for new computing models A proof of concept has been demonstrated A prototype installation is in progress Cross-site image usage will be a challenge


Download ppt "CERN IT Department CH-1211 Genève 23 Switzerland Batch virtualization project at CERN The batch virtualization project at CERN Tony Cass,"

Similar presentations


Ads by Google