“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.

Slides:

Advertisements

Similar presentations

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

Advertisements

Copyright © 2010, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.

B4 Application Environment Load Balancing Job and Queue Management Tim Smith CERN/IT.

1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –

New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week –

1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.

CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

Performance and Exception Monitoring Project Tim Smith CERN/IT.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Module 1: Installing and Configuring Servers. Module Overview Installing Windows Server 2008 Managing Server Roles and Features Overview of the Server.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

Module 11: Implementing ISA Server 2004 Enterprise Edition.

SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.

M. Schott (CERN) Page 1 CERN Group Tutorials CAT Tier-3 Tutorial October 2009.

Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

Batch Scheduling at CERN (LSF) Hepix Spring Meeting 2005 Tim Bell IT/FIO Fabric Services.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

CERN DNS Load Balancing VladimírBahylIT-FIO NicholasGarfieldIT-CS.

The CMS CERN Analysis Facility (CAF) Peter Kreuzer (RWTH Aachen) - Stephen Gowdy (CERN), Jose Afonso Sanches (UERJ Brazil) on behalf.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

CIS250 OPERATING SYSTEMS Chapter One Introduction.

UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.

Computer Centre Shutdown Post-Mortem Tim Smith FIO/IS (Presented at HEPiX by A.Silverman)

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

European Laboratory for Particle Physics Window NT 4 Scaling/Performance Tests Alberto Di Meglio CERN IT/DIS/NCS.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Processes 2 Introduction to Operating Systems: Module 4.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.

External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

System Models Advanced Operating Systems Nael Abu-halaweh.

Patrick Gartung 1 CMS 101 Mar 2007 Introduction to the User Analysis Facility (UAF) Patrick Gartung - Fermilab.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

GDB Meeting 12. January Bernd Panzer-Steindel, CERN/IT 1 Mass Storage at CERN GDB meeting, 12. January 2005.

15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.

Kevin Thaddeus Flood University of Wisconsin

High Availability Linux (HA Linux)

GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.

High Availability in HTCondor

Bernd Panzer-Steindel, CERN/IT

PES Lessons learned from large scale LSF scalability tests

Ákos Frohner EGEE'08 September 2008

OffLine Physics Computing

NCSA Supercluster Administration

Client/Server Computing and Web Technologies

Presentation transcript:

“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT

2002/10/25HEPiX fall 2002: Contents  The road to shared clusters  Batch cluster  Configuration  User challenges  Addressing the challenges  Interactive cluster  Load balancing  Conclusions

2002/10/25HEPiX fall 2002: The Demise of Free Choice

2002/10/25HEPiX fall 2002: Cluster Aggregation

2002/10/25HEPiX fall 2002: Organisational Compromises  Clusters per Groups  Sized for the average  users  Sized for user peaks  users  financiers : wasted resources  Invest effort in recooperating cycles for other groups  Configuration differences / specialities  Bulk Production Clusters  Production fluctuations dwarf those in user anal  Complex cross-submission links

2002/10/25HEPiX fall 2002: Production Farm: Planning

2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers

2002/10/25HEPiX fall 2002: Simple, Uniform Shared Cluster ?

2002/10/25HEPiX fall 2002:  Partitioning  Still have identified resources  Uniform configuration  Sharing  Repartitioning or soak-up queues  If owner experiment reclaims resources, must suspend soak-up jobs – stranded jobs ALICEATLASCMSLHCbALEPHDELPHIL3OPALCOMPASSNtofOPERASLAPPARCPARC IntCVSBUILDDELPHI IntCSFPublic

2002/10/25HEPiX fall 2002: LSF Fair-Share  Trade-in partition for a share  Multilevel  ATLAS 10%, CMS 12%, …  cmsprod 45%, HiggsWG 15%, …  usera 10%, userb 80%, userc 10%  Extra shares for productions  Effort: Juggling resources to Accounting  Demonstrating fairness  Protecting  Policing

2002/10/25HEPiX fall 2002: Facts and Figures  Accounting  LSF job records  Process with C-program  Load into Oracle DB  Prepare plots/tables with Crystal Reports package  LSFAnalyser ?  Monitoring  Poll the user access tools  SiteAssure ?

2002/10/25HEPiX fall 2002: CPU Time / Week Merged user analysis and production farms

2002/10/25HEPiX fall 2002: Performance of Batch Job Slot Analysis ThuFriSa 10 min / tick

2002/10/25HEPiX fall 2002: Challenging Batch (I)  Probing boundaries  Flooding  Concurrent starts  Uncontrolled status polling  Hitting limits  Disk space /tmp /pool /var  Memory, Swap Full  Guarantees for other user jobs?  System Issues  Queue drainers

2002/10/25HEPiX fall 2002: Challenging Batch (II)  Un-Fair-Share  Logging onto batch machines  Batch jobs which resubmit themselves  Forking sessions back to remote hosts  Wasting resources  Spawning processes which outlive the jobs  Sleeping processes  Copying large AFS trees  Establishing connections to dead machines

2002/10/25HEPiX fall 2002: Counter Measures  File system quotas  Virtual memory limits  Concurrent jobs limits per user/group  Restricted access through PAM  Instant response queues  Master node setup  Dedicated, 1GB memory  Failover cluster

2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers LSF MultiCluster

2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers Single Cluster

2002/10/25HEPiX fall 2002: Interactive Cluster  DNS load balancing (ISS)  Weighted load indexes  load, memory  swap rate, disk IO rate  # processes, # sessions, # window mgr sessions  Exclusion thresholds  file systems full, nologins  DNS publish 2 every 30 seconds  Random from lowest 5

2002/10/25HEPiX fall 2002: Daily Users 35 users / node

2002/10/25HEPiX fall 2002: Challenging Interactive  Sidestep load balancing  Parallel sessions across farm  Running daemons  Brutal logouts  Open connections  Defunct processes  CPU sapping orphaned processes  Monitoring +  beniced +  Monthly reboots

2002/10/25HEPiX fall 2002: Interactive Reboots

2002/10/25HEPiX fall 2002: Conclusions  Shared clusters present more user opportunities  Both Good and Bad !  Don’t represent a panacea for sysadmins !