Initial Charge On November 10, 2004, Vicky White (Fermilab CD Head) wrote the following: In order to better serve the entire program of the laboratory the Computing Division will place all of its production resources in a Grid infrastructure called FermiGrid. This strategy will continue to allow the large experiments who currently have dedicated resources to have first priority usage of certain resources that are purchased on their behalf. It will allow access to these dedicated resources, as well as other shared Farm and Analysis resources, for opportunistic use by various Virtual Organizations (VOs) that participate in FermiGrid (i.e. all of our lab programs) and by certain VOs that use the Open Science Grid. (Add something about prioritization and scheduling – lab/CD – new forums). The strategy will allow us: to optimize use of resources at Fermilab to make a coherent way of putting Fermilab on the Open Science Grid to save some effort and resources by implementing certain shared services and approaches to work together more coherently to move all of our applications and services to run on the Grid to better handle a transition from Run II to LHC (and eventually to BTeV) in a time of shrinking budgets and possibly shrinking resources for Run II worldwide to fully support Open Science Grid and the LHC Computing Grid and gain positive benefit from this emerging infrastructure in the US and Europe. 02-Feb-2010Fermilab Campus Grid2
Initial 4 components Common Grid Services: Supporting common Grid services to aid in the development and deployment of Grid computing infrastructure by the supported experiments at Stakeholder Bilateral Interoperability: Facilitating the shared use of central and experiment controlled computing facilities by supported experiments at FNAL – CDF, D0, CMS, GP Farms. Deployment of OSG Interfaces for Fermilab: Enabling the opportunistic use of FNAL computing resources through Open Science Grid (OSG) interfaces. Exposure of the Permanent Storage System: Enable the opportunistic use of FNAL storage resources (STKEN) through Open Science Grid (OSG) interfaces. 02-Feb-2010Fermilab Campus Grid3
Issues and Challenges UID/GIDs: Coordinated site wide Site Wide NFS: BlueArc NFS Server Appliance –A good product, but it does have limitations (which we have found)! Siloed Infrastructures (in early 2005): CDF – 2 clusters, condor, total 3,000 slots D0 – 2 clusters, pbs, total 3,000 slots CMS – 1 cluster, condor, 5,000 slots GP Farm – 1 cluster, condor, 1,000 slots Critical services Figure out how to deliver highly available services Compute Intensive vs. I/O Intensive Analysis: Compute Grid vs. Data Grid. Preemption Planning for the Future 02-Feb-2010Fermilab Campus Grid4
Timeline (Past / Future) Charge from CD Head10-Nov-2004 Equipment ordered15-Feb-2005 1 st core services (VOMS, GUMS) commissioned1-Apr-2005 GP Cluster transitioned to Grid interfaceApr-2005 D0 Clusters transitioned to GUMSMay-2005 FermiGrid Web Page (http://fermigrid.fnal.gov)25-May-2005 Site Gateway26-Aug-2005 Metrics collection (start)26-Aug-2005 Active service monitoring (start)26-Jul-2006 Site AuthoriZation (SAZ) Service1-Oct-2006 Demonstrated full stakeholder interoperability14-Feb-2007 High Availability (VOMS-HA, GUMS-HA, SAZ-HA) Authorization Services3-Dec-2007 Virtualized Services (Squid-HA, Ganglia-HA, MyProxy-VM, Gatekeeper-VM, Condor-VM)Apr to Jun 2008 Virtualized Gratia Service (VM but not HA)mid 2009 High Availability Resource Selection Service (ReSS-HA)01-Oct-2009 MyProxy-HA22-Jan-2010 Gratia-HP/HA(early) Feb-2010 Gatekeeper-HA & NFSliteApr-2010 FermiCloud May-2010 High I/O Intensive Cluster DesignJul-2010 High I/O Intensive Cluster DeploymentNov-2010 02-Feb-2010Fermilab Campus Grid5 today
Stakeholders (today) Fermilab Computing Division Collider Physics: CDF Experiment –3 clusters, ~5,500 slots CMS T1 Facility –1 cluster, ~8,000 slots D0 Experiment –2 clusters, ~5,500 slots General Purpose –1 cluster, ~2000 slots AstroPhysics: Auger Dark Energy Survey (DES) Joint Dark Energy Mission (JDEM) 02-Feb-2010Fermilab Campus Grid6 Neutrino Program: Minos MiniBoone Minerva Long Baseline Neutrino Experiment (LBNE) Argoneut Mu2e Others: International Linear Collider Accelerator Physics Simulations. Theory Open Science Grid VO’s Grid Management
02-Feb-2010Fermilab Campus Grid7 VOMS Server SAZ Server GUMS Server FERMIGRID SE (dcache SRM) Gratia BlueArc Current Architecture CMS WC2 CDF OSG0 CDF OSG1/2 D0 CAB1 GP Farm SAZ Server GUMS Server Step 2 - user issues voms-proxy-init user receives voms signed credentials Step 3 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 5 – Gateway requests GUMS Mapping based on VO & Role Step 4 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 6 - Grid job is forwarded to target cluster Periodic Synchronization D0 CAB2 Exterior Interior CMS WC1 CMS WC3 VOMRS Server Periodic Synchronization Step 1 - user registers with VO VOMS Server CDF OSG3/4 Squid Site Wide Gateway
02-Feb-2010Fermilab Campus Grid8 FermiGrid-HA - Why? The FermiGrid “core” mapping and authorization services (GUMS and/or SAZ) in early 2007 controlled access to: Over 2500 systems with more than 12,000 batch slots. Petabytes of storage (via gPlazma / GUMS). An outage of either GUMS or SAZ can cause 5,000 to 50,000 “jobs” to fail for each hour of downtime. Manual recovery or intervention for these services can have long recovery times (best case 30 minutes, worst case multiple hours). Automated service recovery scripts can minimize the downtime (and impact to the Grid operations), but still can have several tens of minutes response time for failures: How often the scripts run, Scripts can only deal with failures that have known “signatures”, Startup time for the service, A script cannot fix dead hardware.
02-Feb-2010Fermilab Campus Grid9 FermiGrid-HA - Requirements Requirements: Critical services hosted on multiple systems (n ≥ 2). Small number of “dropped” transactions when failover required (ideally 0). Support the use of service aliases: –VOMS:fermigrid2.fnal.gov->voms.fnal.gov –GUMS:fermigrid3.fnal.gov->gums.fnal.gov –SAZ:fermigrid4.fnal.gov->saz.fnal.gov Implement “HA” services with services that did not include “HA” in their design. –Without modification of the underlying service. Desirables: Active-Active service configuration. Active-Standby if Active-Active is too difficult to implement. A design which can be extended to provide redundant services.
02-Feb-2010Fermilab Campus Grid11 FermiGrid-HA - Actual Component Deployment Activefermigrid5 Xen Domain 0 Activefermigrid6 Xen Domain 0 Activefg5x1 VOMS Xen VM 1 Activefg5x2 GUMS Xen VM 2 Activefg5x3 SAZ Xen VM 3 Activefg5x4 MySQL Xen VM 4 Activefg5x1 LVS Xen VM 0 Activefg5x1 VOMS Xen VM 1 Activefg5x2 GUMS Xen VM 2 Activefg5x3 SAZ Xen VM 3 Activefg5x4 MySQL Xen VM 4 Standbyfg5x1 LVS Xen VM 0
02-Feb-2010Fermilab Campus Grid12 FermiGrid-HA - Production Deployment We conducted stress tests of the FermiGrid-HA deployment prior to the production release: The GUMS-HA service supported > 10,000,000 mapping calls/day. The SAZ-HA service supported > 1,000,000 authorization calls/day. FermiGrid-HA was deployed in production on 03-Dec-2007. In order to allow an adiabatic transition for the OSG and our user community, we ran the regular FermiGrid services and FermiGrid-HA services simultaneously for a three month period (which ended on 29-Feb-2008). For the first seven months of operation we achieved 99.9969% availability. We have utilized the HA service redundancy on many occasions: Recovery from “wedged” services, virtual machines or Domain-0 hypervisors that required either a service restart, a reboot of the virtual machine or a reboot of the hardware to resolve. Multiple software updates.
Service Availability - HA vs Non-HA Service Past WeekPast MonthPast QuarterPast Year Hardware100.000% 99.989% VOMS100.000% 99.888% GUMS100.000% SAZ100.000% 99.899% Squid100.000% 99.955%99.863% ReSS100.000% 99.987% ---------- Gatekeepers99.767%99.925%99.904%98.955% Batch99.761%99.462%99.724%99.615% Gratia99.809%98.316%98.707%99.514% 02-Feb-2010Fermilab Campus Grid13
02-Feb-2010Fermilab Campus Grid14 FermiGrid-HA - Conclusions Virtualization benefits: +Significant performance increase, +Significant reliability increase, +Automatic service failover, +Cost savings, +Can be scaled as the load and the reliability needs increase, +Can perform “live” software upgrades and patches without client impact. Virtualization drawbacks: -More complex design, -More “moving parts”, -More opportunities for things to fail, -More items that need to be monitored, -More “interesting” failure modes.
Preemption A key factor in allowing opportunistic use of the various FermiGrid clusters is job preemption. Many codes running on FermiGrid will not work well with checkpoint suspension / resumption. Instead we configure the FermiGrid clusters to perform a gentle preemption - When a cluster “owner” has a job that wants to run in a slot occupied by an opportunistic job, the opportunistic job is given 24 hours to complete: Approximately 65% of all Grid jobs complete in 4-6 hours. Approximately 95% of all Grid jobs complete in less than 24 hours. 02-Feb-2010Fermilab Campus Grid17
Grid Job Duration (All VOs) Insert preemption plot here 02-Feb-2010Fermilab Campus Grid18 # Jobs Job Duration (hours)
Under 1 Hour duration jobs removed 02-Feb-2010Fermilab Campus Grid19 # Jobs Job Duration (hours)
Central Storage - Today BlueArc: 24+60 Tbytes (raw) = 14+48 TBytes (actual). –About to purchase 100+ TBytes of disk expansion. Default Quotas/VO: –Home – 10 GBytes, App – 30 GBytes, Data – 400 GBytes. More space can be made available: –Can you “loan” us 10 TBytes for the next couple of years… Dcache: Opportunistic (~7 TBytes); Resilient; Permanent. Enstore: Tape backed storage; How much tape are you willing to buy? 02-Feb-2010Fermilab Campus Grid21
Compute vs. I/O Intensive Grid FermiGrid is architected as a compute intensive Grid: High ratio of Compute to I/O; Use NFS served volumes to provide “multi-cluster” filesystems (/grid/home, /grid/app, /grid/data); This allows jobs to mostly transparently run on the multiple clusters. Unfortunately, experience has shown us that this architecture is vulnerable to Issues with the SAN utilization; not optimal for data intensive analysis. 02-Feb-2010Fermilab Campus Grid22
Storage Response Time A crude measurement of the “global” storage response time: For each gateway system measure the ssh login time: –$HOME areas are NFS mounted (/grid/home); –time ssh /usr/bin/id Sum over the “gateway” systems. –Under normal conditions logins are quick – under 5 seconds; –What happens when the backend NFS network is under stress? 02-Feb-2010Fermilab Campus Grid23
Storage Response Time (seconds) 02-Feb-2010Fermilab Campus Grid24
I/O Denial of Service Events SAN Occupancy – Other systems doing a backup across the SAN with a “broken” FibreChannel interface. Fixed by moving the backup to a system with a working interface. Poor choice of NFS attribute caching parameter. Fixed by backing out the change. Physics application (root) – Doing sparse I/O directly against “random” files in the NFS /grid/data areas. How fast can the actuators and spindles deliver the data? –Not Fast Enough! –BlueArc then backs up I/O and aggregate I/O throughput plummets. 02-Feb-2010Fermilab Campus Grid25
Simplified Network Topology 02-Feb-2010Fermilab Campus Grid26 san ba head 1 ba head 2 s-s-fcc2-server3 s-s-hub-fcc s-s-fcc1-server s-s-fcc2-server fermigrid0 fnpc3x1 s-cdf-cas-fcc2e s-cdf-fcc1 fcdf1x1 fcdf2x1 fcdfosg4 fcdfosg3 s-d0-fcc1w-cas d0osg1x1 d0osg2x1 fnpc4x1 fnpc5x2 s-f-grid-fcc1 fgtest s-cd-wh8se s-s-wh8w-6 s-cd-fcc2 s-s-hub-wh fgitb-gk Switches for Gp worker nodes d0 wn d0 wn gp wn gp wn cdf wn cdf wn
Central Storage - Future We have a project to investigate additional filesystems this year: Lustre; Hadoop; As well as other filesystems. Why: The BlueArc works very well to support Compute intensive applications; It does not work well when 100’s of copies of an I/O intensive application (eg. root) attempt to randomly access files from the BlueArc filesystems simultaneously. Deliverables: Recommendations on how to build an I/O intensive analysis cluster. Will also look at potential mechanisms to automatically route I/O intensive jobs to this cluster. 02-Feb-2010Fermilab Campus Grid27
Squid FermiGrid operates a highly available squid service to cache Certificate Authority (CA) Certificate Revocation List (CRL) updates. We do this because: We configure the CRL updates to run every hour on all of our systems; We don’t want to be the cause of a denial of service attack against the CA operators. Since we are not using the full capacity of the squid systems, we allow the experiments to also utilize our squid service. 02-Feb-2010Fermilab Campus Grid28
Squid – User Denial of Service #1 Background: User develops a job that requires downloading a 1.2 Gbyte file to perform the analysis. User then decides to cache this file through the FermiGrid squid servers. The users test using the squid server works well for one job, so the user then submits 1,700 simultaneous jobs. If the squid servers were completely dedicated to service these requests, it would require: ( 1702 * 1.2 Gbyte ) / ( 2 * 1 Gbit/second ) = 5.669 hours! But the squid servers are not dedicated to this transfer! Solution - Modify the squid server configurations to limit the maximum object size of any object cache request to ~256 Mbytes to prevent a reoccurrence. 02-Feb-2010Fermilab Campus Grid31
Squid – User Denial of Service 2 Background: Experiment asks to use the FermiGrid squid servers to cache calibration database accesses. Experiment says that their calibration database is served through three (3) database gateway systems that are in a “round robin rotation”. First of the database gateway systems fails with a hardware failure. Discover that the experiments definition of “round robin rotation” is flawed: Try server 1, if access fails then try server 2, if access fails then try server 3. Squid server timeout on access was 15 minutes. Over 1,000 hung connections waiting for the timeout. Blocking other users of the squid service (CRL updates). Solutions: Recommend that the experiment implement a real round robin with heartbeat. Modify squid server configuration parameters to shorten timeout intervals. Don’t trust experiments descriptions of their database access. 02-Feb-2010Fermilab Campus Grid32
Gatekeepers / Worker Nodes Gatekeepers Currently using job home areas served from the central BlueArc: Works well for compute intensive applications; Does not work well for I/O intensive applications; Looking at NFSlite to address this; Will likely be deployed coincident with Gatekeeper-HA. Worker Nodes: Currently mount an NFS served copy of the OSG worker node (WN) installation; Good for compute intensive applications; Does not work as well for I/O intensive applications; Looking at RPM based WN install on each worker node. 02-Feb-2010Fermilab Campus Grid33
Evolution of the Research Program Some experiments (CDF and D0) are nearing the end of their active data taking – the Fermilab Tevatron will likely be decommissioned in the next 1-2 years. CMS (and Atlas) at the CERN Large Hadron Collider have just begun data taking. Other Fermilab programs are rapidly growing: Neutrino physics (intensity frontier) Astrophysics (cosmic frontier) 02-Feb-2010Fermilab Campus Grid34
Cluster Evolution Our cluster deployments must evolve to meet the changing needs of the Fermilab experimental program. Prior to this December, both CDF and D0 had “dedicated” clusters (with opportunistic access to other communities). In mid-December, the latest “option” nodes purchased for the D0 experiment were NOT connected to the D0 Grid clusters, Instead they were connected to the General Purpose (GP) Grid Cluster; D0 was given an “allocation” equal to the number of batch slots that these nodes represented in the GP Grid cluster; These nodes were immediately used by D0 to catch up on their production analysis processing of the data. 02-Feb-2010Fermilab Campus Grid35
D0 Utilization on GP Grid Cluster 02-Feb-2010Fermilab Campus Grid36
CDF & D0 Clusters - Future The deployment of the D0 “option” nodes was a significant success. D0 was able to catch up data processing before the experiment imposed conference dataset deadlines. This will likely continue later in CY2010 with the “option” nodes for the CDF experiment being connected to the GP Grid Cluster. Looking forward into FY2011, FY2012, …: We will commission GP Grid cluster 2; I expect that the “base” and “option” nodes for both CDF and D0 will be added to the GP Grid cluster 2; The experiments will receive an “allocation” across GP Grid clusters 1 & 2 equal to the number of batch slots that they have contributed. This process will continue until the “dedicated” CDF and D0 Grid clusters are completely decommissioned. 02-Feb-2010Fermilab Campus Grid37
FermiCloud Fermilab is investigating Cloud Computing. We will shortly be in the procurement phase of a modest cloud computing capability. The Goals are: Provide virtual systems for software development, system integration and (temporary) production deployment, Provide a platform for our storage system evaluations, Provide additional “peaking” capacity to our GP Grid Cluster: –Idle cloud systems will run a “worker node” Virtual Machine. Provide a platform for investigation of receiving virtual machines via Grid job submission mechanisms. 02-Feb-2010Fermilab Campus Grid38
Future Develop & Deploy Gatekeeper-HA with NFSlite. FermiCloud. Design analysis cluster for I/O intensive applications. Continue CDF, D0 and GP cluster evolution. Automate, automate, automate! 02-Feb-2010Fermilab Campus Grid39
Conclusions The Fermilab Campus Grid (FermiGrid) is working well and delivering significant value to the Fermilab experimental program. We have developed a variety of HA service configurations to insure the operation of FermiGrid. We are embarking on Cloud Computing (FermiCloud). All of this would not have possible without: The support of Vicky White (FNAL CIO); The support of the CD facilities; The support of the network infrastructure and network services; The dedication of the system support staff. Delivering an effective Campus Grid is an exercise in Change Management As time progresses everything evolves, Planning for the future is an ongoing and significant activity. 02-Feb-2010Fermilab Campus Grid40
02-Feb-2010Fermilab Campus Grid41 Fin Any Questions?
Extra Slides Follow 02-Feb-2010Fermilab Campus Grid42
02-Feb-2010Fermilab Campus Grid43 FermiGrid-HA - Challenges #1 Active-Standby: Easier to implement, Can result in “lost” transactions to the backend databases, Lost transactions would then result in potential inconsistencies following a failover or unexpected configuration changes due to the “lost” transactions. –GUMS Pool Account Mappings. –SAZ Whitelist and Blacklist changes. Active-Active: Significantly harder to implement (correctly!). Allows a greater “transparency”. Reduces the risk of a “lost” transaction, since any transactions which results in a change to the underlying MySQL databases are “immediately” replicated to the other service instance. Very low likelihood of inconsistencies. –Any service failure is highly correlated in time with the process which performs the change.
02-Feb-2010Fermilab Campus Grid44 FermiGrid-HA - Challenges #2 DNS: Initial FermiGrid-HA design called for DNS names each of which would resolve to two (or more) IP numbers. If a service instance failed, the surviving service instance could restore operations by “migrating” the IP number for the failed instance to the Ethernet interface of the surviving instance. Unfortunately, the tool used to build the DNS configuration for the Fermilab network did not support DNS names resolving to >1 IP numbers. –Back to the drawing board. Linux Virtual Server (LVS): Route all IP connections through a system configured as a Linux virtual server. –Direct routing –Request goes to LVS director, LVS director redirects the packets to the real server, real server replies directly to the client. Increases complexity, parts and system count: –More chances for things to fail. LVS director must be implemented as a HA service. –LVS director implemented as an Active-Standby HA service. LVS director performs “service pings” every six (6) seconds to verify service availability. –Custom script that uses curl for each service.
02-Feb-2010Fermilab Campus Grid45 FermiGrid-HA - Challenges #3 MySQL databases underlie all of the FermiGrid-HA Services (VOMS, GUMS, SAZ): Fortunately all of these Grid services employ relatively simple database schema, Utilize multi-master MySQL replication, –Requires MySQL 5.0 (or greater). –Databases perform circular replication. Currently have two (2) MySQL databases, –MySQL 5.0 circular replication has been shown to scale up to ten (10). –Failed databases “cut” the circle and the database circle must be “retied”. Transactions to either MySQL database are replicated to the other database within 1.1 milliseconds (measured), Tables which include auto incrementing column fields are handled with the following MySQL 5.0 configuration entries: –auto_increment_offset (1, 2, 3, … n) –auto_increment_increment (10, 10, 10, … )
02-Feb-2010Fermilab Campus Grid46 FermiGrid-HA – Technology (Dec-2007) Xen: SL 5.0 + Xen 3.1.0 (from xensource community version) –64 bit Xen Domain 0 host, 32 and 64 bit Xen VMs Paravirtualisation (best performance). Linux Virtual Server (LVS 1.38): Shipped with Piranha V0.8.4 from Redhat. Grid Middleware: Virtual Data Toolkit (VDT 1.8.1) VOMS V1.7.20, GUMS V1.2.10, SAZ V1.9.2 MySQL: MySQL V5 with multi-master database replication.
02-Feb-2010Fermilab Campus Grid47 FermiGrid-HA - Host Configuration The fermigrid5&6 Xen hosts are Dell 2950 systems. Each of the Dell 2950s are configured with: Two 3.0 GHz core 2 duo processors (total 4 cores). 16 Gbytes of RAM (recently upgraded to 24 Gbytes). Raid-1 system disks (2 x 300 Gbytes, 15K RPM, SAS). Raid-1 non-system disks (2 x 300 Gbytes, 15K RPM, SAS) (recently 2->4). Dual 1 Gig-E interfaces: –1 connected to public network, –1 connected to private network. System Software Configuration: Each Domain 0 system is configured with 5 Xen VMs. –Previously we had 4 Xen VMs. Each Xen VM, dedicated to running a specific service: –LVS Director, VOMS, GUMS, SAZ, MySQL –Previously we were running the LVS director in the Domain-0.
02-Feb-2010Fermilab Campus Grid48 FermiGrid-HA - Performance Stress tests of the FermiGrid-HA GUMS deployment: A stress test demonstrated that this configuration can support ~9.7M mappings/day. –The load on the GUMS VMs during this stress test was ~9.5 and the CPU idle time was 15%. –The load on the backend MySQL database VM during this stress test was under 1 and the CPU idle time was 92%. Stress tests of the FermiGrid-HA SAZ deployment: The SAZ stress test demonstrated that this configuration can support ~1.1M authorizations/day. –The load on the SAZ VMs during this stress test was ~12 and the CPU idle time was 0%. –The load on the backend MySQL database VM during this stress test was under 1 and the CPU idle time was 98%. Stress tests of the combined FermiGrid-HA GUMS and SAZ deployment: Using a GUMS:SAZ call ratio of ~7:1 The combined GUMS-SAZ stress test that was performed on 06-Nov-2007 demonstrated that this configuration can support ~6.5 GUMS mappings/day and ~900K authorizations/day. –The load on the SAZ VMs during this stress test was ~12 and the CPU idle time was 0%.