Presentation is loading. Please wait.

Presentation is loading. Please wait.

LCG deployment workshop summary Alessandra Forti HEP Sysman meeting Manchester 11 November 2004.

Similar presentations

Presentation on theme: "LCG deployment workshop summary Alessandra Forti HEP Sysman meeting Manchester 11 November 2004."— Presentation transcript:

1 LCG deployment workshop summary Alessandra Forti HEP Sysman meeting Manchester 11 November 2004

2 Outline Status Operation Summary (Ian Bird) Fabric Management Summary (Davide Rebatto) Software Management (Steve Traylen) Operation Security (Dave Kelsey) User Support (Flavia Donno)

3 Status LCG covers many sites (>80) now – both large and small with different needs –Requires very flexible packaging, installation, and configuration tools and procedures There are many problems – but in the end it seems to work –Middleware is relatively stable and reliable –System is used in production –> 80 sites managed the installation –Now have a basis on which to incrementally build essential functionality This infrastructure forms the basis of the initial EGEE production service

4 Status Data challenges: –LCG-2 system had been used for the first time during DCs –Significant efforts invested on all sides – very fruitful collaborations –Adaptations were essential – adapting experiment software to middleware and vice-versa –Many problems were recognised and addressed Level of complexity anticipated for LHC O(100) sites: most of the operations issues are already appearing

5 Operations so far Existing mode is not sustainable –Monitoring run by (very few) people at CERN, RAL, Taipei –Tools not yet sufficient Many problems reported –Most get bounced to CERN team (or left to them), but –Several other people quite active in posting and addressing questions No control over bad sites –Remove from information system … But which one? User support –Not at all clear where a user should report problems

6 LCG and EGEE Operations EGEE is funded to operate and support a research grid infrastructure in Europe The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service –LCG includes US and Asia-Pacific, EGEE includes other sciences LCG Deployment Manager is the EGEE Operations Manager –CERN team (Operations Management Centre) provides coordination, management, and 2 nd level support Support activities are divided into three levels –Core Infrastructure Centres (CIC) (4) –Regional Operations Centres (ROC) (9) –Resource Centres (sites)

7 RCs ROCs CICs CICs –run services like RBs, Information Indices, VO/VOMS, Catalogues –are the distributed Grid Operation Center (GOC) –24/7 support! ROCs –coordinate activities in their region –give support to regional RCs –coordinate setup/upgrades –24/7 support ? RC –computing and storage –24/7 support ??? Not clear yet the workflow between the different entities

8 Support Teams within LCG & EGEE Deployment Support (DS) Middleware Problems LHC experiments (Alice Atlas CMS LHCb) Other Communities (VOs), e.g. EGEE non-LHC experiments (BaBar, CDF, Compass, D0) Operations Center (CIC / GOC / ROC) Operations Problems Resource Centers (RC) Hardware Problems Experiment Specific User Support (ESUS) VO spec. (Software) Problems Global Grid User Support (GGUS) Single Point of Contact Coordination of UserSupport

9 Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004

10 CERN IT-GD Issues What is the workflow for operations support? Who participates – in which roles? Escalation procedures, agreement to responsibilities / penalties? How to manage small/bad sites? What is the daily mode of operations and monitoring? Opsman handover procedures etc. Deployment procedures What tools are needed to support this? Who will provide them? How to approach 24x7 global operations support? How this affects workflow; external collaborations What is the interaction/interface to user support? Communication channels? Operations weekly meeting?, RSS, IRC, … Political level agreements on accounting/info gathering granularity Milestones (needed for all working groups) Concrete set of reasonable milestones Fit with service challenges; validate the model; monitoring of milestones Working groups needed for the longer term?

11 CERN IT-GD Model I Strict Hierarchy (modified) CICs locates a problem with a RC or CIC in a region triggered by monitoring/ user alert CIC enters the problem into the problem tracking tool and assigns it to a ROC ROC receives a notification and works on solving the problem region decides locally what the ROC can to do on the RCs. This can include restarting services etc. The main emphasis is that the region decides on the depth of the interaction. ===> different regions, different procedures CICs NEVER contact a site.====> ROCs need to be staffed all the time ROC does it is fully responsible for ALL the sites in the region CIC can contact site directly and notify ROC ROC is responsible for follow-up

12 CERN IT-GD Support workflow Problem 1001 Problem 1002 Problem 1003 Problem 1004 Problem 1005 Problem 1006 Problem 1007 Problem List RC (site) Manage the Problem List GOC Tools GridICE Gppmon Site CERT Gstat ROC RC (site) Problem 1001 Problem 1002 Problem 1003 Problem 1004 Problem 1005 Problem 1006 Problem 1007 Problem List Problem 1001 Problem 1002 Problem 1003 Problem 1004 Problem 1005 Problem 1006 Problem 1007 Problem List FAQ DOC

13 CERN IT-GD Daily ops mode CIC-on-duty (described by Lyon) Responsibility rotates through CICs – one week at a time Manage daily operations – oversee and ensure Problems from all sources are tracked (entered into PTS) Problems are followed up CIC-on-duty hands over responsibility for problems Hand-over in weekly operations meeting Daily ops: Checklist Various problem sources: monitors, maps, direct problem reports Need to develop tools to trigger alarms etc

14 CERN IT-GD Escalation procedures Need service level definitions (Grid 3 site charter) What a site supports (apps, software, MPI, compilers, etc) Levels of support (# admins, hrs/day, on-call, operators…) Response time to problems Agreement (or not) that remote control is possible (conditions) Sites sign-off on responsibilities/charter Publish sites as bad in info system Based on unbiased checklist (written by CICs) Consistently bad sites escalate to political level GDB/PMB Small/bad sites Remote management of services Remote fabric monitoring (GridICE etc)

15 CERN IT-GD Deployment procedures How to formally capture site feedback Priorities for next release, … Web page where info is presented Whats in releases, etc. How to force sites to deploy new releases ROC responsibility Mark site as bad Escalation to GDB, EGEE PMB

16 CERN IT-GD Communications GDA Operations weekly meeting (Grid3 daily mtg service desk+engineers) Could be a model within regions ROCs + sites General news info page RSS customised feeds Various communities General users Run control – messaging/alarm aggregation – sends messages/notifications to ops consoles Use (eg) Jabber as comm tool between CICs (and other operators – ROCs) Mailing lists Rollout Announcements (GOC web page – make people look daily)

17 CERN IT-GD Tools needed GGUS + interfaces to Savannah + local PRMSs (start with Savannah as central aggregator) Monitoring console Monitors (mostly have now) Frameworks – to allow stats and triggers of alarms, notifications, etc. GSI-enabled SUDO (etc) for remote service management Fabric management cook-book Remote fabric monitors

18 CERN IT-GD 24x7 extended support Separate security (urgent) from general support Distributed CIC provides 24x7 by using EU, Taipei, America Real 24x7 coverage only at Tier 0 and 1 Or other specific crucial services that justify cost Loss of capacity – vs damage Classify what are 24x7 problems Direct user support not needed for 24x7 Massive failure should be picked by operations tools

19 CERN IT-GD Interfaces to user support Same structure as for ops support Regional support is needed, but central aggregation (might need language translation) Need inter-ROC agreement on common formats etc. Users free to submit anywhere (local or global) All in same PTS GGUS (ops and user) Documentation and example repository needed in central place Coord done by ROC managers Still need to clarify workflows and make sure people are in place to do the support GGUS becomes the central problem tracker Essential that have rapid evolution as we learn the processes

20 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni Working Group 4: Fabric Management Overall goals: look at site experience in the area of fabric management; focus on fabric management/operations in the context of grid services; understand where common problems or weak areas are; try to converge on some best practices. How can we reduce the frequently heard lament that local problems are one of the main reasons for suboptimal grid reliability? Tentative list of topics: –System installations (LCFGng; Quattor; manual; script-based; what else?). Experience? Which requirements? (e.g.: why do you want/need Quattor? Or manual installation procedures? Or other things? Mix and match?) Which commitments? –Batch/scheduling systems; flavors? Work on testing / development / improvements in this area? Documentation/management tools? How do we define/enforce policies? What policies? (Hard limits? Fairshares?) –How do we monitor the fabric and what do we do with monitoring info; explore what sites are using / have developed; or what they would like to have, and miss. How do we certify/select fabric components? How do we provide status information to fabric users? –Fabric set-ups for the grid (shared file systems? MPI support? Parallel FS? Hyperthreading? Bad/good experiences to share? Use cases? How do we manage storage? Networking?) –Upgrade procedures, how are they triggered, and how do we cope with them? For example: security/performance patches to clusters; upgrades (or downgrades) to batch systems –How do/can we provide fabric-related feedback to LCG/EGEE? –How do we disseminate information about fabric/grid issues? (Fabric training for sites? Training on how batch systems are used – for app developers? Do we [want to] publish policies?) –Conflicts with other (non-grid / non-LCG) uses of the fabric? –Can we identify points of contact for (some of) these topics? –How do we keep on discussing these things? Hepix? Some sort of LCG/EGEE-specific forum? –? Format –For each topic: informal presentations from site operations; discussion; actions, timeline

21 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni System Installations (1) Unanimous desire to have a community-supported way of dealing with system installations –Rather than reinventing/discovering solutions independently Very strong interest in using Quattor –Most sites currently use LCFG, with a couple of sites mix and match systems (LCFG + scripts – not the ones presented on Tuesday by Laurence/Louis) –4 sites (including CERN) have already installed Quattor servers Key point: provide at the very least installation, configuration of Worker Nodes Installation reports being written (NIKHEF, RAL, CNAF) –Pros Modularity, possibility of switching on/off individual components Load balancing, scalability capabilities –Cons Perfectible installation guide, perceived complexity in setting the system up Several components have already been written at CERN, but there is no clear publicity, nor testing outside of CERN (example: afs component) LCG components are also available (Cal Loomis), but, again, with limited documentation

22 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni System Installations (2) Positioning: generic installation procedures (formerly: manual installations) are not competing with Quattor –Rather, we expect to build on top of the generic installation scripts Resources –Within November, well have a formal Quattor WG under the auspices of the GDB –SA1 people can officially work on Quattor (i.e., account for time spent on Quattor in the EGEE timesheets). There is no free lunch here either; we expect the formation of a virtuous circle with this WG to help with configuration and support of Quattor components. Actions –We need not wait for the formation of the Quattor WG to start doing something –Provide details on Quattor scalability (comparable to an analogous work made in EDG times for LCFG) and resilience, with an How-To – German Cancio –Better document which components are available at CERN, and encourage their usage outside of CERN – German Cancio –Organize a Quattor training for interested parties – CERN (around December)

23 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni Fabric Monitoring Very diverse picture, all the usual suspects are present: Nagios, Ganglia, GridICE, LEMON, a variety of self-made scripts –A complaint: there are too many monitoring systems. A suggestion: this needs to be discussed in detail within SA1. But most of the sites also say that they plan to invest time and resources in finding more comprehensive monitoring system (so, potentially, well have even more monitoring systems) –Most of these systems focus on hardware-related metrics (CPU Temperature, fan rotation speed, etc) –Very worrying: only 2 sites actually have systems that proactively try to deal with black holes On the other hand, several sites have experienced black holes, and rely on just manual intervention following automatic notification mechanisms (best case), or user notifications (worst case). This can lead to varying degrees of unreliability. Actions –Create a subgroup to document which tests should be made at the fabric level to verify that WNs, batch system are in good order Try out Piotrs script locally Validate and integrate existing scripts Weizmann, RAL, CNAF, IN2P3, NIKHEF, CERN (German Cancio)

24 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni Some Farm Set-up Issues (1) Hyper-Threading –Most sites have it disabled –One site (CNAF) enabled it after an explicit request from an experiment. But this does not really scale or work if: You have multi-purpose center, or fairshare allocations (rather than statically defined sub-clusters) You do not have properly configured nodes (for example, enough memory to avoid resource contention, enough jobs running on nodes to avoid suboptimal cpu usage, suitable OS) You have conflicting requests from experiments –A difficulty is that experiments do not seem interested in benchmarking their applications (is this feasible at all) –Disk servers supposedly benefit from HT –Actions Gather existing HT data, and perform some tests – CNAF, IN2P3, RAL

25 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni Some Farm Set-up Issues (2) What do you specify as technical requirements in procurement tenders? –Thermal properties (horror stories heard) Serial-ATA vs. SCSI disks: is the price difference justified? This is an example of a topic being discussed in several places (CHEP, Hepix), and we need to channel the discussions into this forum/WG Operational fabric security: a site would like to have an up-to-date list of IP networks to which allow outbound connectivity (to reduce the impact of a DoS attack). A topic for WG1? (operational security) Survey of common use cases for farm set-ups: review document being written in SE Asia (to be out approx. in 1 week) – All

26 Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni Communication Channels We would like to have a single point of contact for grid- sysadmin-related topics. But the format for this is not agreed yet –A web page/portal? Who is going to maintain it? Yet another suggestion to SA1? –LCG-ROLLOUT is too generic Do we need/want another mailing list? There are topics we would have liked to discuss, and did not have the time –Should getting together in a similar WS be repeated [more frequently]? Should we wait 6 months for the Hepix meeting in Karlsruhe? We should report about what we discuss and do in the context of this WGs activities at Hepix –An item for discussion in the GDB? How is WG2 (operational support) going to address this?

Download ppt "LCG deployment workshop summary Alessandra Forti HEP Sysman meeting Manchester 11 November 2004."

Similar presentations

Ads by Google