PES Lessons learned from large scale LSF scalability tests

Slides:



Advertisements
Similar presentations
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Plans on Virtualization Ian Bird On behalf of IT WLCG Workshop, 9 th July 2010.
Adam Duffy Edina Public Schools.  The heart of virtualization is the “virtual machine” (VM), a tightly isolated software container with an operating.
Introduction to DoC Private Cloud
SOFTWARE AS A SERVICE PLATFORM AS A SERVICE INFRASTRUCTURE AS A SERVICE.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 2.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Update Ian Bird On behalf of IT Multi-core and Virtualisation Workshop,
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Vignesh Ravindran Sankarbala Manoharan. Infrastructure As A Service (IAAS) is a model that is used to deliver a platform virtualization environment with.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
SEMINAR ON.  OVERVIEW -  What is Cloud Computing???  Amazon Elastic Cloud Computing (Amazon EC2)  Amazon EC2 Core Concept  How to use Amazon EC2.
CERN IT Department CH-1211 Genève 23 Switzerland Benchmarking of CPU servers Benchmarking of (CPU) servers Dr. Ulrich Schwickerath, CERN.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
CERN IT Department CH-1211 Genève 23 Switzerland PES Virtualization at CERN – a status report Virtualization at CERN – status report Sebastien.
CERN IT Department CH-1211 Genève 23 Switzerland The CERN internal Cloud Sebastien Goasguen, Belmiro Rodrigues Moreira, Ewan Roche, Ulrich.
CERN IT Department CH-1211 Genève 23 Switzerland Batch virtualization project at CERN The batch virtualization project at CERN Tony Cass,
Dynamic Extension of the INFN Tier-1 on external resources
Cloud Technology and the NGS Steve Thorn Edinburgh University (Matteo Turilli, Oxford University)‏ Presented by David Fergusson.
Title of the Poster Supervised By: Prof.*********
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Processes and threads.
Current Generation Hypervisor Type 1 Type 2.
Virtualisation for NA49/NA61
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
Diskpool and cloud storage benchmarks used in IT-DSS
INFN Computing infrastructure - Workload management at the Tier-1
Enrico Bonaccorsi, (CERN) Loic Brarda, (CERN) Gary Moine, (CERN)
GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.
Virtualisation for NA49/NA61
Process Management Presented By Aditya Gupta Assistant Professor
Windows Server* 2016 & Intel® Technologies
Practical aspects of multi-core job submission at CERN
Quattor Usage at Nikhef
Статус ГРИД-кластера ИЯФ СО РАН.
Virtualization in the gLite Grid Middleware software process
Chapter 21: Virtualization Technology and Security
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
Provisioning Performance of name server Software
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Migration Strategies – Business Desktop Deployment (BDD) Overview
Chapter 22: Virtualization Security
Public vs Private Cloud Usage Costs:
Virtualization Layer Virtual Hardware Virtual Networking
NCSA Supercluster Administration
Outline Virtualization Cloud Computing Microsoft Azure Platform
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Process & its States Lecture 5.
Prof. Leonardo Mostarda University of Camerino
Virtualization Dr. S. R. Ahmed.
Exploring Multi-Core on
Presentation transcript:

PES Lessons learned from large scale LSF scalability tests Sebastien Goasguen, Belmiro Rodrigues Moreira, Ewan Roche, Ulrich Schwickerath, Romain Wartel HEPIX2010, Cornell University

PES Outline How we did it Lessons learned: While doing it Why we did it How we did it Lessons learned: While doing it From the actual measurements

Facts about the CERN batch farm lxbatch Why scalability tests ? Facts about the CERN batch farm lxbatch Size: ~ 3,500 physical hosts ~ 40 hardware types ~ 25,000 CPU cores ~ 80 processing queues ~ 180,000 jobs/day ~ 22,000 jobs total per time ~ 20,000 registered users User jobs: Chaotic and unpredictable mixture of CPU or I/O intensive jobs mostly independent sequential jobs, few multi-threaded Currently supported OS systems: SLC5 (64bit) and SLC4 (32/64bit) Lessons learned from large scale LSF scalability tests - 3

Why scalability tests ? Growth rate over the past years >50% dedicated One LSF instance Lessons learned from large scale LSF scalability tests - 4

Why scalability tests ? LSF limits Vendor information (Platform priv. Comm.) (*): Communication layer ~20,000 hosts Batch system layer ~5,000-6,000 hosts Up to ~500,000 jobs Previous LSF scalability tests at CERN Focus on batch queries and jobs Focus on fail-over system layout Limitations on WN not testable (*) mostly based on simulations

Why scalability tests ? We already start to see slower response times at our current scale when there are many concurrent queries Virtualizing the batch worker nodes has a big impact on the number of resources the batch system needs to manage – 8:1 ratio now, even more in the future? The use of virtualization enabled us to test LSF up to a larger scale than ever before – up to 16,000 worker nodes Lessons learned from large scale LSF scalability tests - 6

Temporary access to 480 new physical worker nodes How we did it Temporary access to 480 new physical worker nodes 2 x XEON L5520 @ 2.27GHz CPUs (2 x 4 cores) 24GB RAM ~1TB local disk space HS06 rating ~97 SLC5 XEN (1 rack with KVM for testing) Up to 44 VMs per hypervisor (on private IPs) Lessons learned from large scale LSF scalability tests - 7

How we did it By making use of our internal Cloud infrastructure ! Opportunity to test and probe in addition: The general infrastructure The dynamic extending and shrinking of the batch farm The worker node behavior at large scale (software setup) The image distribution system The image provisioning systems Lessons learned from large scale LSF scalability tests - 8

How we did it Initial limitations to protect infrastructure No lemon monitoring of the virtual machines No AFS support in the virtual machines Rapid coming and going of clients, issue with call backs Use of ssh keys to access VMs instead of krb5 No attempt to quattor-manage individual VMs Lessons learned from large scale LSF scalability tests - 9

General infrastructure: Lessons learned General infrastructure: DNS update rate exhausted when registering VMs Throttle the creation/destruction rate High LDAP query rate Fixed batch worker node tools (pool cleaner) Switched off quota on the VMs May need more ldap servers in the future Lessons learned from large scale LSF scalability tests - 10

Lessons learned LSF itself … Main issue: dynamic resizing of the farm Close collaboration with Platform Computing Tests beyond of what is officially supported Several iterations with bug fixes on Mbatchd Sbatchd Lim Lessons learned from large scale LSF scalability tests - 11

Lessons learned LSF software: main issues found/fixed Mbatchd performance issue with dynamic hosts Internal reconfigurations slow down the system Some mbatchd crashes, specifically while removing hosts Internal 9999 limit on number of hosts in use Large memory footprint of mbatchd Will hopefully be addressed in future LSF versions Lessons learned from large scale LSF scalability tests - 12

LSF master hardware upgrade Lessons learned LSF master hardware upgrade Started with 24GB RAM dual Nehalem CPU server Replaced by 48GB RAM dual Nehalem CPU server Mbatchd forking problems with >12,000 VMs Enough to reach ~16,900 VMs and ~400,000 jobs Lessons learned from large scale LSF scalability tests - 13

Lessons learned LSF scalability test layout: the measurement grid 0 – 400,000 jobs 50,000 jobs steps Use of job arrays à 1,000 jobs each Short micro benchmark jobs Most tests done with dispatching enabled 0 – 15,000 worker nodes Bin size ~2,500 worker nodes Initially XEN, later XEN + KVM ISF and ONE used as provisioning systems Lessons learned from large scale LSF scalability tests - 14

Lessons learned LSF scalability test layout: probing the system Data requisition from a monitoring script which Ran on the master and some clients in a loop Recorded the Time stamp Number of nodes and number of jobs Average LSF communication response time Average LSF batch command response time Lessons learned from large scale LSF scalability tests - 15

Lessons learned Nodes Total jobs Running jobs Response time Lessons learned from large scale LSF scalability tests - 16 Response time

Lessons learned LSF communication layer: lsid (client view) Behaves as expected Fast response time Some increase > 12,500VM Lessons learned from large scale LSF scalability tests - 17

Lessons learned LSF communication layer: lshosts, client view Returns a list of hosts in the system Behaves as expected Small dependency on job number Lessons learned from large scale LSF scalability tests - 18

Lessons learned Batch system layer: job submission with bsub (client) Statistics + binning ? T0 requires t<1s !!! Issues >10,000 nodes Lessons learned from large scale LSF scalability tests - 19

Lessons learned Batch system layer: job queries (client view) Increase with jobs Increase with worker nodes Lessons learned from large scale LSF scalability tests - 20

Lessons learned Scheduling performance: average job slot usage Above 10,000 nodes up to 4%-5% empty job slots Ratio of Running jobs and Job slots Lessons learned from large scale LSF scalability tests - 21

Scheduling performance 1% free Ratio of Running jobs and Job slots Broadening distributions 4% free Lessons learned from large scale LSF scalability tests - 22

Summary Thanks to our IaaS infrastructure we have brought LSF to its limits Up to 20x the number of jobs Up to 4-5x the number of worker nodes The results are consistent with the expectations from the vendors statements The results have been made available to Platform and discussed with them in detail Lessons learned from large scale LSF scalability tests - 23