CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
A. Frank - P. Weisberg Operating Systems Process Scheduling and Switching.
Process Description and Control Module 1.0. Major Requirements of an Operating System Interleave the execution of several processes to maximize processor.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
1 Distributed Systems: Distributed Process Management – Process Migration.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Distributed Process Implementation Hima Mandava. OUTLINE Logical Model Of Local And Remote Processes Application scenarios Remote Service Remote Execution.
Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Can we use the XROOTD infrastructure in the PROOF context ? The need and functionality of a PROOF Master coordinator has been discussed during the meeting.
Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
1 Process migration n why migrate processes n main concepts n PM design objectives n design issues n freezing and restarting a process n address space.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland PES SVN User Forum David Asbury Alvaro Gonzalez Alvarez Pawel Kacper Zembrzuski 16 April.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
Concurrency & Context Switching Process Control Block What's in it and why? How is it used? Who sees it? 5 State Process Model State Labels. Causes of.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR Overview.
Item 9 The committee recommends that the development and operations teams review the list of workarounds, involving replacement of palliatives with features.
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Copyright ©: Nahrstedt, Angrave, Abdelzaher
High Availability Linux (HA Linux)
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Lecture 16: Data Storage Wednesday, November 6, 2006.
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Operating Systems (CS 340 D)
Chapter 9: Virtual Memory
Chapter 2: System Structures
1 VO User Team Alarm Total ALICE ATLAS CMS
Ákos Frohner EGEE'08 September 2008
Operating Systems (CS 340 D)
湖南大学-信息科学与工程学院-计算机与科学系
Process & its States Lecture 5.
Process Description and Control
Chapter 2: Operating-System Structures
Condor-G Making Condor Grid Enabled
Chapter 2: Operating-System Structures
QlikView for use with SAP Netweaver Version 5.8 New Features
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk Cache Scheduling LSF, Job Manager and Python Policies. Dennis Waldron CERN / IT

CERN - IT Department CH-1211 Genève 23 Switzerland 2Outline LSF limitations, pre v2.1.3: –Resource Monitoring and Shared Memory. –LSF changes and New Scheduler Plugin. –Python Policies. v2.1.4: –Scheduling Requirements/Problems –Job Manager v –Future Developments (v2.1.6 & v2.1.7)

CERN - IT Department CH-1211 Genève 23 Switzerland 3 LSF Limitations, pre releases What was killing us: The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities. LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153) Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate. RmMaster did not keep node status after restart (#15832) Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance. These were just the start!!! Additional Information available at:

CERN - IT Department CH-1211 Genève 23 Switzerland 4 Resource Monitoring and Shared Memory In both the LSF plugin and Resource Monitor (rmMasterDaemon) now share a common area of memory for exchanging information between the two processes. –Advantage: Access to monitoring information from inside the LSF Plugin is now a pure memory operation on the scheduler machine. (extremely fast!) –Disadvantage: the rmMasterDaemon and LSF must operate on the same machine! (no possibility for LSF failover) Changes to daemons in 2.1.3: –rmmaster became a pure submission daemon. –rmMasterDaemon was introduced for collecting monitoring information. –rmnode was replaced by rmNodeDaemon on all diskservers

CERN - IT Department CH-1211 Genève 23 Switzerland 5 Resource Monitoring Cont. New monitoring information contains –On diskservers : ram(total + free), memory(total + free), swap(total + free), load, status and adminStatus. –For each filesystem : space(total + free), nbRead/ReadWrite/WriteStreams, read/writeRate, nbMigrators, nbRecallers, status and adminstatus. Monitoring intervals : –1minute for slow moving info (total*, *status) –10s for fast moving info (*Streams, *rate, load) Status can be Production, Draining or Down Admin status can be None, Force or Deleted –Set via rmAdminNode. –Force prevents updates from monitoring. –Deleted, deletes it from the DB –Release allows to move back from Force to None By default, new diskservers are in status DOWN and admin status FORCE.

CERN - IT Department CH-1211 Genève 23 Switzerland 6 Added multiple LSF queues, one per svcclass. –Not for technical reasons!!! –Allows for user restrictions at queue level and better visualization of jobs on a per svcclass basis via bqueues. Utilisation of External Scheduler options during job submission. –Recommended by LSF experts. –Increased job submission from 10 to 14 jobs/second. –Calls to LSF (mbatchd) from CASTOR2 components reduced from 6 to 1. As a result queue limitations no longer needed. (Not totally disappeared!!) –Removed the need for message boxes, i.e. jobs are no longer suspended and resumed at submission time. –Requires LSF_ENABLE_EXTSCHEDULER to be enabled in lsf.conf (both scheduler and rmmaster machines) LSF changes and New Scheduler Plugin

CERN - IT Department CH-1211 Genève 23 Switzerland 7 LSF Changes Cont. Filesystem selection now transferred between LSF and the job (stagerJob) via the SharedLSFResource. –The location of the SharedLSFResource can be defined in castor.conf –Can be a shared filesystem e.g NFS or web server Why is it needed? –LSF is CPU aware not filesystem aware. –The LSF scheduler plugin has all the logic for filesystem selection based on monitoring information and policies. –The final decision needs to be transferred between the Plugin and the LSF execution host. –Could have been LSF messages boxes or the SharedLSFResource. Neither are great! But, we select the lesser of two evils!

CERN - IT Department CH-1211 Genève 23 Switzerland 8 LSF Python Policies Why? –Filesystem selection has moved from the Stager DB to the Plugin. The Plugin must now take over its functionality. –Scheduling needs to be sensitive to other non scheduled activity and respond accordingly. Initial implementation was a basic equation with coefficients set in castor.conf. –Advantage: Simplicity –Disadvantages Simplicity Every new internal release during testing of required changes to this equation inside the code!! We couldn’t ask the operations team to make these changes during runtime so another language was need for defining policies. The winner was Python!

CERN - IT Department CH-1211 Genève 23 Switzerland 9 Python Policies Cont. Examples: /etc/castor/policies.py.example Policies are defined on a per svcclass level. Many underestimate there importance! Real example: 15 diskservers, 6 LSF slots each, all slots occupied transferring 1.2GB files in both read and write directions. Expected throughput per stream ~ 20MB/s (optimal) Problems: –At 20 MB/s migration and recall streams suffer. –Migrations and Recalls are unscheduled activities. Solution: –Define a policy which favours migration and recall streams by restricting user activity on the disk server allowing more resources (bandwidth, disk I/O) to be used by migrations and recalls.

CERN - IT Department CH-1211 Genève 23 Switzerland 10 The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities. No messages boxes, 6 to 1 LSF calls LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153) Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate. Now at 14 jobs/second RmMaster did not keep node status after restart (#15832). States now stored in the Stager DB for persistence Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance. Shared memory implementation These were just the start!!! Additional Information available at: LSF Limitations, pre releases What was killing us:

CERN - IT Department CH-1211 Genève 23 Switzerland 11 Job submission rates still not at the advertised LSF rate of 20 jobs per second. Jobs remain in a PEND’ing status indefinitely in LSF if no resources exist to run them (#15841) Administrative actions such as bkills do not notify the client of a request termination (#26134) CASTOR cannot throttle requests if they exceed a certain amount (#18155) - infamous LSF meltdown Scheduling Requirements/Problems A requirement was needed for a daemon to manage and monitor jobs whilst in LSF and take appropriate actions where needed.

CERN - IT Department CH-1211 Genève 23 Switzerland 12 Job Manager - Improvements The stager no longer communicates directly with the submission daemon. –All communication is done via the DB making the jobManager stateless. –Two new statues exist in the subrequest table SUBREQUEST_READYSHCED13 SUBREQUEST_BEINGSCHED14 –No more timeouts between stager and rmmaster resulting in duplicate submissions and rmmaster meltdowns. Utilises a forked process pool for submitting jobs into LSF. –The Previous rmmaster forked a process for each submission into LSF which is expensive. –The number of LSF related process is now restricted to 2 x the number of submission processes. –Improved submission rates from 14 to 18.5 jobs/second New functionality added to detect when a job has been terminated by an administrator, `bkill` and notify the client to the jobs termination. –New error code: 'Job killed by service administrator'

CERN - IT Department CH-1211 Genève 23 Switzerland 13 Job Manager – Improvements Cont. Jobs can now be killed if they remain in LSF for too long in a PEND’ing status. –The timeout value can be defined on a per svcclass basis. –The user receives error code: 'Job timed out while waiting to be scheduled‘. Jobs whose resource requirements can no longer be satisfied can be terminated: –Error code: 'All copies of this file are unavailable for now. Please retry later‘ –Must be enabled in castor.conf via option JobManager/ResReqKill Multiple JobManagers can operate in parallel for a redundant, high availability solution. All known rmmaster related bugs closed!

CERN - IT Department CH-1211 Genève 23 Switzerland 14 Future Developments Disk-2-Disk copy scheduling Support for multiple rmMasterDaemons running in parallel on a single CASTOR 2 instance.

CERN - IT Department CH-1211 Genève 23 Switzerland 15 Comments, questions?