Presentation is loading. Please wait.

Presentation is loading. Please wait.

Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)

Similar presentations


Presentation on theme: "Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)"— Presentation transcript:

1 Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7) This was the real showstopper Several, time consuming attempt to cleanup and reinstall Reinstallation apparently successful, but the release was corrupted again after an hour or so – StoRM silently stopping to process requests The underlying GPFS file system halted in an apparent deadlock, but the storage areas were correctly mounted -> no alarm was triggered Unfortunate timing of the two, occurred contemporaneously during Summer holidays (reduced manpower) – Other, non directly related problems (air conditioning of computing room, server h/w failures) required attention, further reducing the available manpower

2 Release installation issue (solved) In Milan, the WNs are split in two rooms, each one belonging to a different subnet, with a single NFS server providing the s/w area to both the rooms through two different network adapters The different NFS network names confused the s/w installation system, generating a race condition between installation jobs on the different WN subsets Definitively understood and solved (by including all WNs in a common subnet) only after three weeks – It wasn’t a really difficult one, but efforts were focused on the other, storage related issue

3 GPFS issue GPFS randomly goes in a deadlock state – A GPFS thread starts waiting for an unknown condition to occur on a remote node – Waiter threads start to pile up on one of the Network Disk Servers (NDS), waiting for the first one to complete – The reason for the hung thread is still not known. Possible candidates: Failure of the underlying storage hardware Network issues GPFS bug … – No clear sign of any of this, though – Very similar problem observed at Tier1 They are still investigating too – Ticket opened with IBM support We were asked to gather some debugging data, but since then, the problem occurred only twice, during non working hours, and the system was automatically restarted No solution found yet, only a workaround to detect the deadlock and restart the services (GPFS and StoRM) – This eased the consequences of the problem, avoiding further exclusion from DDM


Download ppt "Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)"

Similar presentations


Ads by Google