Presentation is loading. Please wait.

Presentation is loading. Please wait.

RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska 13-17 October 2014 Martin Bly, STFC-RAL.

Similar presentations


Presentation on theme: "RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska 13-17 October 2014 Martin Bly, STFC-RAL."— Presentation transcript:

1 RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska 13-17 October 2014 Martin Bly, STFC-RAL

2 14/10/2015HEPiX Fall 2014 - RAL Site Report

3 Tier1 Hardware CPU: ~127k HS06 (~13k cores) Storage: ~13PB disk Tape: 10k slot SL8500 (one of two in system) FY14/15 procurement –Tenders ‘in flight’, closing 17 th October –Expect to procure 6PB and 42kHS06 –Depends on price… New this time: –Storage capable of both Castor and CEPH Extra SSDs for CEPH journals –10GbE for WNs 14/10/2015HEPiX Fall 2014 - RAL Site Report

4 Networking Tier1 LAN –Mesh network transfer progressing slowly –Phase 1 of new Tier1 connectivity enabled –Phase 2: move the firewall bypass and OPN links to new router Will provide 40Gb/s pipe to border –Phase 3: 40Gb/s redundant link T1 to RAL Site RAL LAN –Migration to new firewalls completed –Migration to new core switching infrastructure almost complete –Sandboxed IPv6 test network available Site WAN –No changes 14/10/2015HEPiX Fall 2014 - RAL Site Report

5 Network Weathermap 14/10/2015HEPiX Fall 2014 - RAL Site Report

6 Virtualisation Issues with VMs –Had two production clusters with shared storage, several local storage hypervisors –Windows Server 2008 + Hyper-V –Stability and migration problems on shared storage systems –Migrated all services to local storage clusters New HV clusters –New configuration of networking and hardware –Windows Server 2012 and Hyper-V –Three production clusters Include additional hardware with more RAM Tiered storage on primary clusters 14/10/2015HEPiX Fall 2014 - RAL Site Report

7 CASTOR / Storage Castor –June: Upgrade to new major version (2.1.14) with various improvements (disk rebalancing, xroot internal protocol) Upgrade complete –New logging system with ElasticSearch –Draining disk servers still slow Major production problem Ceph –Evaluations continue on the small test cluster SSDs for journals installed in cluster nodes –Testing shows mixed performance results, needs more study –Large departmental resource 30 servers, ~1PB total –Dell R520, 8 x 4TB SATA HDD, 32GB RAM, 2 x E5-2403v2, 2 x 10GbE 14/10/2015HEPiX Fall 2014 - RAL Site Report

8 Storage failover What if the local storage is unavailable? What if someone else’s local storage is unavailable Xrootd allows for remote access of data resources on demand if local data is not available At RAL, bulk data traffic bypasses firewall –To/from OPN and SJ6 for disk servers only –NOT WNs What happens at firewall? –Concern for non-T1 traffic if we have a failover Tested with assistance from CMS Firewall barely notices –Very small setup, then transfer offload to ASICs –Larger test to come 14/10/2015HEPiX Fall 2014 - RAL Site Report

9 JASMIN/CEMS Hardware Sept 2014RIG The JASMIN super-data-cluster UK and Worldwide climate and weather modelling community. Climate and Environmental Monitoring from Space (CEMS) …and all of NERC environmental sciences since JASMIN2 Eg Environmental genomics, mud slides etc Facilitating further comparison and evaluation of models with data. 12 PB Storage Panasas at STFC (Largest in the world) Fast Parallel IO to Physical and VM servers Largest capacity Panasas installation in the world (204 shelves) Arguably one of top ten IO systems in the world (~250GByte/sec) Virtualised and Physical Compute (~3500 cores) Physical Batch compute “LOTUS” User + Admin provisioned Cloud of virtual machines Data xfer private network links to UK and World sites

10 2014-15 JASMIN2  Expanded from 5.5PB to 12PB of high performance disc and added ~3,000 CPU cores + ~5PB tape  Largest single site Panasas deployment in the world  Benchmarks suggest this might be in the top ten IO systems in the world  Includes a large (100 server + 400TB NetApp VMDK storage) VMware vcloud Director Cloud deployment with custom user ortal  1200+ 10Gb eth ports non-blocking, zero congestion L3 ECMP/OSPF low latency (7.5mS MPI) interconnect –One converged network for everything –Implementing VXLAN L2 over L3 technology for Cloud  Same SuperMicro servers used for batch/MPI computing and cloud/hypervisor work –Mellanox ConnectX3 pro NICs do low latency for MPI and VXLAN offload for Cloud –Servers all 16 core Ivy Bridge, 128GByte with some at 512GB. All 10Gb networking  JASMIN3 this year will add mostly 2Tbyte RAM servers and several PB's of storage

11 Other stuff Shellshock –Patched exposed systems quickly –Bulk done within days –Long tail of systems to chase down Electrical ‘shutdown’ for circuit testing –Scheduled for January Phased circuit testing Tier1 will continue to operate, possibly with some reduced capacity Windows XP (still) banned from site networks New telephone system rollout complete Recruited a grid-admin, starts soon Recruiting a System Admin and a Hardware technician soon 14/10/2015HEPiX Fall 2014 - RAL Site Report

12 Questions? 14/10/2015HEPiX Fall 2014 - RAL Site Report


Download ppt "RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska 13-17 October 2014 Martin Bly, STFC-RAL."

Similar presentations


Ads by Google