Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)

Slides:



Advertisements
Similar presentations
Chapter 5 Deadlocks. Contents What is deadlock? What is deadlock? Characterization Characterization Resource allocation graph Resource allocation graph.
Advertisements

Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Global States.
Byzantine Generals. Outline r Byzantine generals problem.
NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
UCoM Software Architecture Universal Communicator Research UCoM Programming Model The Problem  Multi-threaded code is difficult to write.
Best Practices – Overview
National Manager Database Services
London Tier 2 Status Report GridPP 12, Brunel, 1 st February 2005 Owen Maroney.
1. There are different assistant software tools and methods that help in managing the network in different things such as: 1. Special management programs.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Making a great Project 2 OCR 1994/2360. Analysis This is the key to getting it right. Too many candidates skip through this section. It’s worth 20% of.
HPC USER FORUM I/O PANEL April 2009 Roanoke, VA Panel questions: 1 response per question Limit length to 1 slide.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
© CCI Learning Solutions Inc. 1 Lesson 5: Basic Troubleshooting Techniques Computer performance Care of the computer Working with hardware Basic maintenance.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
Computer Maintenance and Troubleshooting
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
ICT IGCSE.  Introducing or changing a system needs careful planning  Why?
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Chapter 7 – Deadlock (Pgs 283 – 306). Overview  When a set of processes is prevented from completing because each is preventing the other from accessing.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Threaded Programming in Python Adapted from Fundamentals of Python: From First Programs Through Data Structures CPE 401 / 601 Computer Network Systems.
Chapter 1 (PART 1) Introduction to OS (concept, evolution, some keywords) Department of Computer Science Southern Illinois University Edwardsville Summer,
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Deadlock Detection and Recovery
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Deadlock Operating Systems: Internals and Design Principles.
Debugging Threaded Applications By Andrew Binstock CMPS Parallel.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Controls Zheqiao Geng Oct. 12, Autosave Additions/Upgrades and Experiences at SLAC Zheqiao Geng Controls Department SLAC National Accelerator Laboratory.
TEL62 AND TDCB UPDATE JACOPO PINZINO ROBERTO PIANDANI CERN ON BEHALF OF PISA GROUP 14/10/2015.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
Observing the Current System Benefits Can see how the system actually works in practice Can ask people to explain what they are doing – to gain a clear.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Candidates should be able to:  describe the purpose and use of common utility programs for:  computer security (antivirus, spyware protection and firewalls)
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Testing CernVM-FS scalability at RAL Tier1 Ian Collier RAL Tier1 Fabric Team WLCG GDB - September
COP 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.
+ AliEn status report Miguel Martinez Pedreira. + Touching the APIs Bug found, not sending site info from ROOT to central side was causing the sites to.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
How to Fix Missing WMVCore.dll Error in Windows 10
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
How to fix Missing Windows Sockets Registry Entries required for Network Connectivity in Windows 10 /pages/Reimage- Repair- Tool/ /u/6/b/
Problems With Assistance Module 3 – Problem 3 Filename: PWA_Mod03_Prob03.ppt This problem is adapted from: Exam #2 – Problem #1 – ECE 2300 – July 25,
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Debugging Intermittent Issues
1 VO User Team Alarm Total ALICE ATLAS CMS
Luca dell’Agnello INFN-CNAF
Debugging Intermittent Issues
WLCG Management Board, 16th July 2013
Oxford Site Report HEPSYSMAN
Process Description and Control
1 VO User Team Alarm Total ALICE ATLAS CMS
Test-rigs outside CERN
Take the summary from the table on
Chapter 12: Concurrency, Deadlock and Starvation
MAINTAINING SERVER AVAILIBILITY
High Performance Storage System
Presentation transcript:

Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly ) This was the real showstopper Several, time consuming attempt to cleanup and reinstall Reinstallation apparently successful, but the release was corrupted again after an hour or so – StoRM silently stopping to process requests The underlying GPFS file system halted in an apparent deadlock, but the storage areas were correctly mounted -> no alarm was triggered Unfortunate timing of the two, occurred contemporaneously during Summer holidays (reduced manpower) – Other, non directly related problems (air conditioning of computing room, server h/w failures) required attention, further reducing the available manpower

Release installation issue (solved) In Milan, the WNs are split in two rooms, each one belonging to a different subnet, with a single NFS server providing the s/w area to both the rooms through two different network adapters The different NFS network names confused the s/w installation system, generating a race condition between installation jobs on the different WN subsets Definitively understood and solved (by including all WNs in a common subnet) only after three weeks – It wasn’t a really difficult one, but efforts were focused on the other, storage related issue

GPFS issue GPFS randomly goes in a deadlock state – A GPFS thread starts waiting for an unknown condition to occur on a remote node – Waiter threads start to pile up on one of the Network Disk Servers (NDS), waiting for the first one to complete – The reason for the hung thread is still not known. Possible candidates: Failure of the underlying storage hardware Network issues GPFS bug … – No clear sign of any of this, though – Very similar problem observed at Tier1 They are still investigating too – Ticket opened with IBM support We were asked to gather some debugging data, but since then, the problem occurred only twice, during non working hours, and the system was automatically restarted No solution found yet, only a workaround to detect the deadlock and restart the services (GPFS and StoRM) – This eased the consequences of the problem, avoiding further exclusion from DDM