Presentation on theme: "David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L."— Presentation transcript:
2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L liquid He 12,000,000L liquid N 2 800,000,000 proton-proton collisions/sec. The Large Hadron Collider at CERN
3 Data from the LHC Experiments 55 million channels Raw data = 220 MB/s 18 million channels Raw data = 100 MB/s ATLAS (7,000 tonnes)CMS (12,500 tonnes) ALICE (10,000 tonnes) LHCb (5,600 tonnes) 1.2 million channels Raw data = 50 MB/s Concorde (15 Km) Mt. Blanc (4.8 Km) One years data from LHC would fill a stack of CDs 20km high Raw data flow ~700 MB/s Total ~15 PB of data per year 100 million channels Raw data = 320 MB/s
4 Data Driven Grid Computing 31/03/2014 Grid architecture chosen because : Costs of maintaining and updating resources more easily shared in a distributed environment. Funding bodies can provide local resources and contribute to global goal. More easy to build redundancy and fault tolerance and minimise risks from single point of failure. LHC will operate around the clock for 8 months each year. Spanning of time zones means that monitoring/support more readily provided. ALICE ATLAS CMS LHCb
5 Worldwide LHC Computing Grid 28/May/09 Tier 0 Tier 1 National centres Tier 2 Regional groups Institutes Workstations Offline farm Online system CERN computer centre RAL,UK ScotGridNorthGridSouthGridLondon FranceItalyGermanySpain GlasgowEdinburghDurham 11 T1 centres Simulation, Analysis Primary Data Store Reconstruction, Storage, Analysis
7 How does it work? Components Tier 0, Tier 1, Tier 2 DATA MOVEMENT – FILE TRANSFER SERVICE (FTS) STORAGE INTERFACE – STORAGE RESOURCE MANAGER (SRM) AUTHORISATION/ROLES – VIRTUAL ORGANISATION MEMBERSHIP (VOMS) METADATA/REPLICATION – LCG FILE CATALOGUE (LFC) BATCH SUBMISSION – WORKLOAD MANAGEMENT SYSTEM (WMS ) DISTRIBUTED CONDITIONS DATABASES – ORACLE STREAMS (3D) GRID INTERFACES (e.g. Ganga) PRODUCTION/ANALYSIS SYSTEMS GRID MIDDLEWARE EXPERIMENT FRAMEWORKS WLCG FABRIC
8 How does it work? Workflow gridui JDL VOMS WLMS JS RB LFC BDII Logging & Bookkeeping 3 CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources 4 5 Submitter 6 7 89 10 0 VOMS-proxy-init 1 Job Submission 2 Job Status? 11 Job Retrieval
9 Availability: The UK Tier-1 Availability fraction of time the site is up (so even scheduled maintenance counts against this metric). Target is 97% (achieved). Measured by SAM tests (Service Availability Monitor). There are also experiment-specific SAM tests which are more demanding. Example shown here is from ATLAS. Target is also 97%. Performance is improving but was degraded by the CASTOR mass storage system.
11 Resilience and Disaster Planning The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site. One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality. 28/May/09
12 Strategy Fortifying the Service Duplicate services or machines Increase the hardwares capacity (to handle faults) Use (good) fault detection Implement automatic restarts Provide fast intervention Fully investigate failures Report bugs -> ask for better middleware Disaster Planning Taking control early enough. (Pre-) establishing possible options. Understanding user priorities. Timely Action. Effective Communication. Hardware; Software; Location
13 Duplicating Services or Machines Multiple WMSes
14 Hardware Capacity and Fault Tolerance Examples: Storage – Use raid arrays: RAID5 RAID6 for storage arrays; RAID1 for system disks. Use of hot-spares allows automatic rebuilds. Memory – Increase memory capacity; use ECC (error-correction-code) memory and monitor for a rise in error correction rate. Power – Use redundant power supplies connected to different circuits if possible. UPS for critical systems. Interconnects - Use two or more bonded network connections with cables routed separately. CPU – Use more powerful machines. Databases – Use Oracle RACs (Real Application Clusters) which enable multiple servers to access database simultaneously. Resilient hardware will help services survive common failure modes and keep it operating until you can replace the component and make the service resilient again.
15 Fault Detection If it can be monitored, monitor it! Catch problems early e.g. with nagios alarms. Load alarms; File systems near to full; Certificates close to expiry; Failed drives Look for signatures of impending problems to predict component failure. Idle disks hide their faults –Regular low-level verification runs to push sick drives over the edge –Replace early in failure cycle So it doesnt fail during a rebuild… Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation –If you have redundant links, you can replace the faulty one and keep the service going Call-out system for problems that impact services
16 Intervention and Investigation 28/May/09 Run 24 x 7 call out system connected to a pager that is triggered by automatic alarms. 2 hour response time for critical failures. All incidents are examined to learn lessons: Call-out rate has dropped from 10/day to as low as 1/week. Reports written up on serious incidents (reported to the wLCG so other sites around the world can see).
18 Disaster Planning 28/May/09 Need a Disaster Response plan which is well understood – use it regularly for anything that could turn into a disaster! Stage 1: Disaster Potential Identified –Informally Assess/Monitor/Set deadlines/Do not interfere. Stage 2: Possible Disaster –Add internal management oversight/Formally assess/Divert resources Stage 3: Disaster Likely –Add external experts and stakeholder representation to oversight. –Regular meetings with the experiments. –Prepare contingencies; Communicate widely. Stage 4: Actual Disaster –Manage disaster according to high level disaster plan and contingencies identified at Stage-3. Communicate widely.
19 Summary In the UK we have spent the last 6 years preparing for the LHC data challenge and have deployed 20,000 CPUs as part of a world-wide Grid of 180,000 CPUs: The largest scientific computing Grid in the world. The last year has focused on making the service reliable and resilient: Our Tier-1 centre currently delivers 97% availability and our Tier-2 centres average over 90%. We have initiated planning to understand the possible responses to major disaster and to set up a disaster management process to handle such incidents. We look forward to the arrival of LHC data! 1/Apr/09 LHC Data