Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI- 222667 Enabling Grids for E-sciencE COD21 22.09.09 EGEE09 Barcelona C-COD Survey results Vera Hansper.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI- 222667 Enabling Grids for E-sciencE COD21 22.09.09 EGEE09 Barcelona C-COD Survey results Vera Hansper."— Presentation transcript:

1 EGEE-III INFSO-RI- 222667 Enabling Grids for E-sciencE www.eu-egee.org COD21 22.09.09 EGEE09 Barcelona C-COD Survey results Vera Hansper

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona About the survey 5 simple questions to assess how the community sees the C-COD role 6 ROCS responded – 2 ROCS had responses from 2 different ROD teams – 1 ROC had responses from 5 ROD teams – 12 separate responses in total All replies were welcome – Even if not all questions answered Skews the results slightly – Even is some responses were out of context

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 1. How do you see the role of C-COD? Interpretation of what the C-COD role actually does. – 8 responses: an oversight role for ROD teams A co-ordination role overseeing quality of operations – 1 response: interpreted as the dashboard – 1 response: stressed lightweight framework of role – 2 responses: no comments One response was overall happy with C-COD and had no further suggestions

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona Good summary of the role (courtesy of UKI) Oversight and quality control of the RODs Help in ticket handling for non-ROC matters Provision of ROC tools Integration of resource status into ROD tools (dashboard) Coordination of RODs

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 2. Do you find it useful? Affirmative: 11 No Response: 1 Negative: 0 Further comments: – Could be improved – Allows one to discover anomalies and sites that are not working in the proper way, thereby reducing problems on the production grids. – some matters are beyond the ROC's control – For quality control, many Operators found the COD intervention to be too invasive - the ROD operators know the sites better than the c-COD (however, they do understand that this was to help with the transition to the ROD model).

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 3. Do you find it necessary? Affirmative: 11 No Response: 1 Negative: 0 Further comments: – Definitely – a production grid needs such kind of support.

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 4. Do you think there is something missing from the daily operations handling of tickets? There are some related steps mentioned in different sections, for example, – In 6.4.2.1 Sites in downtime: – When a ticket is open against a site that continues to add downtime the tickets must be closed... This case can also be put to 6.3.3 Closing tickets... So maybe its better to have a workflow/flow chat to explain this kind of procedure/steps. There is no automatic closing of useless alarms (even though this is not easy to implement, but it is necessary). The two tickets that are generated when raising a ticket against a site (dashborad and GGUS) seem to be a bit much. 1 tickets would reduce management overheads

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 4. responses, cont. I think that some automations are still missing; C-COD (sic) shifters are currently occupied by a number of "trivial" operations that probably could be executed automatically by the dashboard software. Or maybe we are only in need of a more ergonomic and integrated visualization (some infos are duplicated on more tabs, some other are hard to find). I think that the CIC dashboard support greatly help our work. I think it's ok.

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 4. responses, cont. Maybe a better handling of sites not yet in production in the GOCDB, if they are monitored and in the site-bdii they will still show up, and we have practically the choice to put them in downtime or switch monitoring off (the later is not what the sites want, as they also want to know that the samtests work and don't reconfigure their site-bdii all the time) We don't see the not in production info in the dashboard, and we will have to close tickets where the error is practically still "on" and that is bad for our metrics. At least I would like to have a best practices for that (Do we know how downtimes are counted for nodes not yet in production for the sites???)

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 4. responses, cont. There is consensus that we need a method in dashboard to resolve all current alarms in OK status (like the old "global" button). – This will allow one to concentrate time/effort in the really interesting ones.

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 5. Do you feel something else could or should be done regarding daily operations, and if so, what? SLA can be considered with daily operations. There are a lot of clicks to be done if following the procedures properly. Many of the alarms are actually resulting from some one-off temporary transient failures, which although are rare for a single site, when looking at a number of sites happen quitefrequently. This generates some "noise in the system" especially since a site admin has no way of solving such a problem and this results only in frustration. Apart from the tests for Alice/LHCB etc. that are seen on COD dashboard, it will be wise to have also the results for the other VOs that the site supports, so that site administrators or country-level operators can investigate what is wrong.

12 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 5. responses, cont. Again I hope that more integration with ROD ticketing system will improve the efficiency of the shifts. – A stronger integration among tools could further improve our activity. It is wonderful not to have to do it every or every second week, Our nordic solution there is best to keep us motivated to a minimum, because otherwise it would get very boring and it would be more difficult to hold a certain standard. – Would like to have a clear split in the dashboard Metrics between Nordics and the Benelux part of the NE ROC. It should be possible to remove a person or at least all of their roles in the GOCDB.

13 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona 5. responses, cont. The handling of alarms needs to be improved so that: – There are less false alarms (eg. the host cert check often fails, and then is later OK). In some cases, no other alarms were raised, so this may indicate that there is a problem with the alarm. – Alarms should be self-healing - operators spend a lot of time switching of transient alarms.

14 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 COD21 22.09.09 EGEE09 Barcelona Summary It appears that C-COD, as an oversight body is perceived as needed – though there are things that could possibly be changed – Further study of these responses to find improvements – More feedback could also be useful More ideas and feedback have been obtained regarding the operations in general. – Also needs better analysis


Download ppt "EGEE-III INFSO-RI- 222667 Enabling Grids for E-sciencE COD21 22.09.09 EGEE09 Barcelona C-COD Survey results Vera Hansper."

Similar presentations


Ads by Google