BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented.

Slides:

Advertisements

Similar presentations

How We Manage SaaS Infrastructure Knowledge Track

Advertisements

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -

IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.

Database Architectures and the Web

Copyright GeneGo CONFIDENTIAL »« MetaCore TM (System requirements and installation) Systems Biology for Drug Discovery.

Introduction to DBA.

Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.

GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.

Network Administration Procedures Tools –Ping –SNMP –Ethereal –Graphs 10 commandments for PC security.

Yes, yes it does! 1.Guest Clustering is supported with SQL Server when running a guest operating system of Windows Server 2008 SP2 or newer.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

Test results Test definition (1) Istituto Nazionale di Fisica Nucleare, Sezione di Roma; (2) Istituto Nazionale di Fisica Nucleare, Sezione di Bologna.

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Database Deployment on OSG Yuri Smirnov BNL US ATLAS DDM operations and MC production Workshop, BNL September 28-29, 2006.

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.

Computer Emergency Notification System (CENS)

High-Availability MySQL DB based on DRBD-Heartbeat Ming Yue September 27, 2007 September 27, 2007.

Daniela Anzellotti Alessandro De Salvo Barbara Martelli Lorenzo Rinaldi.

CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.

Microsoft Azure SoftUni Team Technical Trainers Software University

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

VMware vSphere Configuration and Management v6

Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.

High Availability in DB2 Nishant Sinha

December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.

1 | SharePoint Saturday Calgary – 31 MAY 2014 About Me.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

FroNTier at BNL Implementation and testing of FroNTier database caching and data distribution John DeStefano, Carlos Fernando Gamboa, Dantong Yu Grid Middleware.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre

OIS Progress on Drupal pilot service ENTICE meeting, 30 th September 2010 Jarosław (Jarek) Polok IT-OIS Operating systems and Internet services.

Chapter 1 Introducing Windows Server 2012/R2

Jean-Philippe Baud, IT-GD, CERN November 2007

Understanding and Improving Server Performance

NGS Oracle Service.

Virtualization in the gLite Grid Middleware software process

Outline Virtualization Cloud Computing Microsoft Azure Platform

Presentation transcript:

BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented by Dantong Yu

BNL: ATLAS Computing 2  Build a scalable and resilient PanDA service for ATLAS  Support ATLAS VOs and thousands of ATLAS users and jobs.  Reliable, scalable, and high performance.  Cost-effective and flexible deployment.  A joint effort between Physics Application Group and RACF Grid Computing Group to deploy and operate every component in PanDA system.  In this talk:  BNL PanDA architecture  PANDA Components  PANDA Hardwares  Required Software Infrastructure and Grid Middleware  Infrastructure and Procedure to Download and Install Required RPMS  Nagios Based Panda Monitoring Systems  Operation Procedures  Experienced Problems Motivation and Outline

BNL: ATLAS Computing 3 BNL PanDA architecture 3

BNL: ATLAS Computing 4 Clients PanDA Server Mnt. Server AutoPilot PanDA Server Mnt. Server PanDA DBPanDA Archive … … F5 Server Load Balancing switch rewrites IP header for src. and dest. Addr., IP relay. Clients Virtual Services Physical Servers VIP Reliable/High Performance ATLAS Job Management Architecture (PanDA) 44

BNL: ATLAS Computing 5 Panda Components 5

BNL: ATLAS Computing 6  Production System  Front End Load Balancers  F5 switch does load balance and reliability.  Its transparency allows flexible management of the heterogeneous service, with only minimal application-level configuration and coding necessary to support integration with the smart switch.  Panda Monitoring Service, Panda Server, and Panda Logging Servers stateless  Dispatches jobs to pilots as they request them, HTTPS-based, stateless. It needs to connect to central Panda DB.  Provides a graphic read-only information about Panda function via HTTP. GUI is also stateless. It needs to connect to central Panda DB.  Logs Panda Server Events into the Panda DB.  Autopilot submission systems (stateful)  Using Condor-g/site gatekeepers to fill sites with pilots.  Panda Pilot Wrapper Code Distributor: Subversion with Web front-end.  Dynamically download pilot wrapper script from the Subversion web cache.  Panda Database System  Production Panda Components

BNL: ATLAS Computing 7 Panda Development and Testbed Systems  Panda Testbed and Development Systems  Panda Monitoring Service and Panda Server  Database System

BNL: ATLAS Computing 8 Panda Hardware 8

BNL: ATLAS Computing 9 Panda Hardware  Each component group requires a separate set of hosts and hardware. Most servers should be standalone except a few of them.  Front end Load balancers: Two F load balance switches.  Panda Monitor, Panda Server, and Panda Logging Servers.  Dual quad-core Intel Xeon CPU 2.00GHz. (eight cores per host), 16 GB memory, and six 750GB SATA drives. (software RAID 10 provided 2TB local storage): three servers  Autopilot submission systems (Local pilots and global pilots). Stateful  Dual quad-core Intel Xeon CPU 2.66GHz. (eight cores per host), 16 GB memory, and two 750GB SATA drives. (Mirrored disks): four servers  Panda pilot wrapper code distributor: subversion with Web front-end  Dual quad-core Intel Xeon CPU 2.66GHz. (eight cores per host), 8 GB memory, and two 150GB SAS drives. (Mirrored disks): 1 server. Need Archive system to recover if disk storage is lost.  Web Apache server

BNL: ATLAS Computing 10 BNL ATLAS MySQL Production and Development Servers  Following BNL production MySQL servers are used:  2 Panda-production MySQL servers (INNODB): primary and spare, dual dual core with 16GB memory and 64 bits OS.  4 Panda-archive MySQL servers (MyISAM): 2 primary + 2 spare, 2 quad- core processors with 16GB memory and 64 bit OS.  daily text-based backup (database content) for all databases on production servers above with the extra disk-copy on a special data- server having an interface to the tape.  64 bit-architecture, x86_64, Elsm.  Six 15k rpm SAS drives, each with 145GB disk space.  Details can be found at

BNL: ATLAS Computing 11 ATLAS MySQL Production Databases at BNL: Details and Performance  Panda production MySQL server and its replica server with identical hardware: “ fast-buffer” DataBase. keeps the info about all Panda managed Reprocessing, MC-production and user-analysis jobs for up to 2 weeks, the cron-job moves the data into archive periodically. designed initially for USATLAS, since September 2007 supports 10 different ATLAS clouds (CERN, CA, DE, ES, FR, NL, UK, US, TW and 2 instances for Nordugrid - ND,NDGF ). runs MySQL version 5.0.X. engine InnoDB, simple structure, autoincrement for IDs, no foreign keys. 31 tables, max number of rows ~16,500,000. provides with the fast multiple parallel connections to basic Panda-components: Panda-server, Panda-monitor and Logger. Performance access pattern: ~ parallel threads open simultaneously all the time (max ~600) performance: average ~360 q/sec. (max > 800) query-type: select ~35%, update ~35%, insert ~25%, others (delete, etc. ~5%) nice monitoring interface Panda-monitor:

BNL: ATLAS Computing 12 Critical DBs on Four Panda Archival Database Servers  Panda Archive production MySQL server (along with a spare node)‏  Database PandaArchiveDB  keeps the full archive of Panda managed reprocessing, Monte-Carlo, production and user analysis jobs since the end of  engine MyISAM, no autoincrement, replication from PandaDB through crons.  partitioning: bi-monthly structure of job/file archive tables for better search performance.  44 tables, max number of rows ~33,000,000 per table.  DataBases PandaLogDB, PandaMetaDB  keep the archive of log-extract files for jobs, some monitoring information about pilots, autopilot and scheduler-configuration support (schedconfig).  engine MyISAM, ~52-54 tables.  partitioning: bi-monthly structure for some tables.  max number of rows ~4,600,000 per table.  access pattern: ~ parallel threads open (max ~740).  performance: average ~ q/sec. (max ~2800), select (~80%), insert ~20%.

BNL: ATLAS Computing 13 Panda Server Infrastructure 13

BNL: ATLAS Computing 14 PanDA Software Infrastructure  OS, Grid Middleware, and Software Requirements  OS (RHEL/SL 4) RPMs: mod_ssl, subversion, rrdtool, openmpi, gridsite, graphtool, matplotlib, MySQL.  Glite-UI 3.1: Setup from /etc/profile.d/.  CA Certificates installed/updated.  Unix accounts w/ ssh-key access: sm  Python 2.5 (from Tadashi) RPMs: python25, mod_python25, python25-curl, python25-numeric, MySQL-python25, python25- imaging.

BNL: ATLAS Computing 15 PanDA Autopilot  Glite-UI 3.1: Setup from /etc/profile.d/  CA Certificates installed/updated  Unix accounts w/ ssh-key access: sm, (sm2 for grid autopilot, usatlas1 for local submission)  Condor w/ custom configuration

BNL: ATLAS Computing 16 Panda System OS Administration  Initial install Semi-manual setup script is at: /afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh Semi-manual setup script is at: /afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh  Ongoing package maintenance: BNL Redhat satellite system.  Condor admin: on systems with global Condor, config changes and restart requires root.  Account management: occasional SSH key additions for new team members.

BNL: ATLAS Computing 17 Panda Monitoring Systems 17

BNL: ATLAS Computing 18 Panda Monitoring Systems

BNL: ATLAS Computing 19 USATLAS Tier 2 Sites

BNL: ATLAS Computing 20 MySQL Servers Monitoring We use three monitoring tools for MySQL servers: - MySQLStat: Provide Monitoring Service for Internal ATLAS Community: BNL ATLAS MySQL servers, CERN MySQL servers, some other MySQL servers in USA and Europe. - Ganglia - Nagios: provides Critical Server Status, sends warnings and alarms if service has problem, opens RT tickets and can do some simple automatic recovery.

BNL: ATLAS Computing 21 MySQL Servers Monitoring: MySQLstat

BNL: ATLAS Computing 22 Panda Operation Procedure 22

BNL: ATLAS Computing 23 RT SLA Nagios RT In case of a failure of a critical machine or service Nagios generates alarms and send alarms to SLA systems. When service recovers, Nagios generates a notification to SLA again. RACF SLA System provides a configurable alarm management layer that automates service alerts from Nagios based monitoring system. It provides a configurable alarm management layer that automates service alerts from Nagios based monitoring OSG Footprints RT can exchange problem reports with external ticketing systems. Machines and services monitored by Nagios GGUS Escalation if no response happens with SLA specified time window

BNL: ATLAS Computing 24 Experienced Operation Problems

BNL: ATLAS Computing 25 Panda Server and Databases Problems  Panda Server Hanging  A cron job at database server detects the slow query, disconnects the Panda server’s MySQL connection if it appears to be slow.  Panda processes do not handle this disconnection, wind up to be frozen.  Panda Server had to be restarted either manual or automatically by Nagios.  Panda Database Server Load  Enhanced database monitoring capabilities, and identify intrusive queries and particular users and applications which initiate the query, and worked with users to modify MySQL queries.  Effectively and significantly reduces the number of slow queries.  Purchased licensed MySQL Backup software to reduce the backup time.

BNL: ATLAS Computing 26 Condor-G Based Auto-Pilots  Condor-G and Gatekeepers uses GASS servers to synchronize jobs status, and large number of Condor-G jobs add significant loads and result in status loss and held jobs.  Frequently Condor-G freezing due to large number of held jobs.  Pilot job status reported by Condor-G is out of synch with the actual status of ATLAS jobs.  To kill held pilots jobs caused early aborting good ATLAS jobs.  Work with Univ. of Wisconsin to customize the condor-G.  Stage-in and stage-out events into the user log for better diagnosis.  More Condor-G tuning options for large number of job submission and dispatch.  More fine tuning knobs have separate throttles, for example: limiting jobmanagers by their role: submission -vs- stage-out/removal.  Efficiently process failed jobs and prevent bad jobs clogging the submission system: when a gridmanager decides to put a job on hold, instead use the hold_reason as the abort_reason and abort the job.

BNL: ATLAS Computing 27 Panda Monitoring  Front end switch system hanging due to expired licenses.  New Python version and Oracle clients require manual compile.  Certificate authority does not issue certificates with DN containing wild card (*). Clients could not properly do X509 certificated based authenticate with multiple backend severs behind F5 switch.

BNL: ATLAS Computing 28 Summary  Contributions:  Innovation in hardware resilience, extensive monitoring, and automatic problem reporting and tracking.  Significantly enhance the reliability of the evolving Panda system.  Support easy access to the system for software improvement.  Condor-G is slow to update Pilot status, causing inconsistency between actual job status and Panda monitoring.  Frequency of Condor-G component crashing: was fixed after condor team provided condor