ATLAS Software Installation redundancy Alessandro De Salvo Alessandro ATLAS Software Installation redundancy Alessandro De Salvo Alessandro.DeSalvo@roma1.infn.it 08-11-2011 Outline System hosted in Rome Redundancy of the Installation system and the other services Current situation and plans A. De Salvo – Oct 08 2011
The power of nature On Oct 20, 2011 Rome was flooded by an unexpected amount of rain 127mm of rain in about 3 hours The site INFN-ROMA1 had to be switched off, after the water reached the servers As you might know computers cannot swim easily! 100000 tons of water in the computing room, pumped out in about 12 hours
Services hosted in Rome Installation System Two databases (rw, ro), installation agents (EGEE, OSG, CVMFS) Redundant services, but hosted by the same site Global KitValidation Portal and main KV cache KV cache mirrored at CERN Installation tools cache Hosted in the KV cache Release validation portal All the named services stopped working on Oct 20 and were resumed 5 days later
Temporary solutions A toy installation system (LJSFlite) re-written from scratch in ~8 hours 3 analysis caches, 1 base release and 1 patch deployed with LJSFlite while the main system was down > 500 validations Compatible with the main system Using KV from the CERN mirror (no GKV) Missing services GKV Release DB Main installation system Installation tools (compilers, global patches)
Full redundancy solutions (in progress) The installation system already supports native redundancy Multiple agents, can be located in different sites > 500 validations DB replicas 1 rw replica Multiple ro replicas Logfile access facility Glusterfs georeplication Experimenting a WAN automatic failover system Ring replication between N DB replicas (multi-master) 1 rw replica, 3 ro replica Main rw replicat and 1 ro replica in Roma, 1 ro replica in Napoli, https://atlas-install.na.infn.it/atlas_install Ready to test the automatic switching ro -> rw for the active replica, via watchdog Testing the global failover domain, pointing to the active replicas, using the INFN HA DNS https://atlas-install.ha.infn.it rw DB ro DB ro DB
Full redundancy solutions [2] GKV and release databases can be hosted in the same Installation System replicas Release DB already hosted in the mainInstallation System DB space GKV can be added Installation tools will be mirrored at CERN Simple synchronization
Current situation and plans Installation System replica Main DB instance in Roma, working backup in Napoli Will add at least a third replica at CERN Every replica is fully functional, it will use the local replica to show the ro status and the current rw replica for the actions You can now access the installation system via the HA domain (experimental) https://atlas-install.ha.infn.it GKV replica Can be added easily to the Installation System replicas, after the fs georeplication is in place Testing the georeplication now, needs the upgrade of the main DB machine, to be done by the end of this month KV & Installation Tools Partially done, to be completed in the next few days