Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The day-to-day work of a TPM Gkamas Vasileios.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The day-to-day work of a TPM Gkamas Vasileios."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The day-to-day work of a TPM Gkamas Vasileios (vgkamas@cti.gr) Research Academic Computer Technology Institute, Rio, Patras, Greece TPM Training, 10 th -11 th November 2008, SARA, Amsterdam http://indico.cern.ch/conferenceDisplay.py?confId=41895

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 2 Outline This presentation aims to provide you... –Knowledge of GGUS levels support... ... and their support units –Knowledge of the GGUS workflow –TPM Purposes and actions  Registration  GGUS access  How to handle tickets –TPM guidelines and best procedures

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 3 Support Units in GGUS Central GGUS Application VO Support Units Middleware Support Units Network Support Unit Operations Support ROC Support Units Deployment Support Unit - Solves - Classifies - Monitors TPM 1 st level Preproduction Support Unit ETICS Support Unit 2 nd level3 rd level User World Supporter World

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 4... And types of problems Accounting Authorization/Authentication Catalogue COD operations Configuration Data Management – generic Deployment - other Documentation File Access File transfer Information system Installation Local Batch System Middleware Monitoring Operations Network problem Other Security Storage systems VO specific software Workload Management 3D/Databases This list is for the user to indicate the type of problem when submitting a ticket.

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 5 The TPM Two types of TPM: the Generic TPM and the VO TPM Generic TPM: –Generic Grid middleware experts – mainly from the ROC's. –Experience in Grid infrastructure installation, configuration and usage. –Provide the “First Line Support”. –Provide answers to tickets whenever possible. –Assign tickets to middleware support units or ROC's. –“Watch for long time unchanged tickets” and notify the respective Support Units or ROC's bringing them to attention. VO TPM: –Similar duties to the Generic TPM. –Experience also in generic VO software problems. –Assign tickets to VO support units.

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 6 Purpose of a TPM Close simple trouble tickets Notify users about the status of their tickets Ensure the correct routing of tickets for processing React to alarms that tickets have not been processed Ensure that the knowledge base and Wiki pages are enhanced by the responses associated with tickets –http://goc.grid.sinica.edu.tw/gocwiki/FrontPage Rotate its operation among the participating ROCs in a timely, co-operative and informative way

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 7 Register as a TPM GGUS web portal: https://gus.fzk.de/https://gus.fzk.de/ Access to GGUS via certificate is preferable Information regarding registration is available on the registration link –The link to this page is located in the top navigation bar of GGUS To register as support staff, click the link “Apply” and fill in the registration form. –After registering successfully you will receive a confirmation mail from GGUS.

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 8 Procedures for “New tickets” (I) The TPM is notified by mail for every new ticket TPM should try to evaluate if: –He can give a solution –He can give a solution, or if after interaction with the user, a solution can be provided.  This requires generic knowledge of documentation, middleware and grid services operations –It's a site service problem  Ex: A user is complaining that a file transfer or job submission is failing to a given site. –It's a software problem or bug –It's a software problem or bug, or some “obscure” option in the component X API. After assigning a ticket… send a mail to the user – the TPM should send a mail to the user informing him that the ticket has been acknowledged and sent to the proper support unit

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 9 Procedures for “New tickets” - II TPM is notified by mail for every change in every ticket. –The “new” tickets are in general easier to treat The first “impulse” of the TPM is to assign a new ticket right away to a ROC when a site service error is obvious –But sometimes errors are not so transparent as they seem and may mask deeper problems –You should be critical and try to evaluate the problem the best way you can –This will avoid incorrect routing of tickets  … and time loss on it resolution So, let's just see what comes next...

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 10 Procedures for “New tickets” - III So... What do you do if: –“OOPS... my dCache has been behaving strangely since two days ago, before that it was working perfectly” –Didn't upgrade or touch anything!! TPM reaction…TPM reaction… –Ask more info from the user:  Ask for dCache logs –Take a look at the logs to see if you can figure out something or what is wrong.  If you cannot find what goes wrong, assign the ticket to the dCache SU

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 11 Procedures for “New tickets” - IV So... What do you do if: –My 100 FTS transfers – about 40 have failed to TIER1 X but the others were OK. TPM reactionTPM reaction –Ask the user to provide information about some of the failed FTS transfers –Check from the FTS transfers if the problem is due to a site problem / misconfiguration  If yes assign the ticket to the responsible ROC –Probably try to check if there are “ENOC” tickets opened that involve the site where the transfer is failed. –At a certain point it may be needed to either involve the FTS developers, or assign the ticket to them.

12 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 12 Procedures for “New tickets” - V Note that the previous cases were just hypothetical, we will get to the “real world” tomorrow. Sometimes these type of problems go on a trial and error assignment. –It is important that all parties involved don't get “furious” if they get a ticket which is not in their “power” to solve. Wrong support unit ticket assignment –The ticket is re-assigned back to the TPM –The support unit re-assigns directly the ticket to the appropriate support unit.

13 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 13 Related tickets This is a generic procedure for any supporter, TPM, ROC's and MW SU. If a supporter identifies a ticket related to an older open ticket: –Insert the other ticket reference in the “related issue” field. –For one of the tickets “send a mail to the submitter” saying that the ticket #NNNN reports the same problem. –Set the status of the ticket to “unsolved”, letting the other one on “In progress” state –The “unsolved” ticket should be set to solved only after the related one is closed.

14 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 14 GGUS ticket search engine – I You have several options available in the "search engine" Connect to the GGUS interface Select “Support Staff” Select “Enter the GGUS ticket search engine”

15 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 15 GGUS ticket search engine – II The TPM will most commonly select:

16 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 16 GGUS ticket search engine – III The Middleware SU and ROC's will most commonly use the “Support Unit” field ROC_SE tickets with status “open” gLite Workload SU tickets with status “solved”

17 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 17 GGUS ticket search engine – IV I want to find the tickets from a user, say “Vasilis Gkamas”

18 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 18 GGUS ticket search engine – V And... a supporter will want to see the “unsolved” tickets he/she involved:

19 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 19 How to recognize if it's a service or software ticket A user complains that a given command or action fails regarding a given site: –Check if there isn't any error on the command syntax submitted by the user. –Try yourself the same commands  If they fail in that site, try issuing the same commands to another site. Assign it to the ROC…  If they are successful to another site this can indicate a site local problem: Assign it to the ROC…  If the command to another site gives the same error, it can also indicate problems in core services, like the VOMS server, LFC, RB...  Try using another core service if possible…  … –In the end of the day, the problem can be some bug on the MW component.

20 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 20 Mail interface The TPM and the involved support units or ROC's are notified by mail when a ticket is updated. –The mail “subject” starts with: GGUS-Ticket-ID: #NNNNN....... –The sender is: helpdesk@ggus.org You can simply reply to that mail and your answer will appear in the history of the ticket –The user is notified. –Note that you should not change the subject of the mail since this is used to refer to the ticket and to enter into the system. You should not use the mail interface for more complex tasks –It is preferable to use the GGUS application

21 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 21 Communication with the user Communication with the submitter of the ticket –Use the “Public diary” link on top of the “Modify section Ticket-ID: #NNNNN”. Internal diary comments –Are always hidden from the ticket submitter The user can receive all updates to his/her tickets if … – he/she has chosen: “Notification mode: On every change” when submitting a ticket. The user is notified by mail about the “solution” of the ticket only when the ticket is set to “solved”.

22 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 22 About “old tickets” – I Common mistakes: –SU/ROC's try to communicate with the user either through the “Internal diary” or the “Final solution field”.  The user does not “see” the information on those fields  When the ticket is set to solved is the user able to see what is in the “solution field” –“Waiting for reply”: Although the user has answered to the supporters questions, the ticket stays in that state for long period.  Ask the SU to take an action –Wrong assignment: a ticket is sometimes assigned to the wrong ROC or SU.  This is specially true for “difficult” tickets when the TPM is not quite sure what is the real source of the problem.

23 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 23 About “old tickets” – II ROC Tickets: What the TPM can or should do !!! –Try to take a “Holistic” view per ROC.  See if there are rather old tickets with a long history of escalations. –Write in the internal diary asking the ROC to “take an action” or notify the sites.  The ROC's have a rather strict and documented “operational procedures” from the COD. –It has been observed that a notification to the ROC through the ticket produces results.  In general the ROC reacts and notifies the sites for them to act on the tickets.

24 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 24 About “old tickets” – III ROC Tickets: What the TPM can or should do !!! –The sites are not asked, either by the TPM or the ROC to close the ticket. –They are asked to provide information about the status and progress of the problem. –Notification about the status and progress of the ticket must be sent to the submitter of the ticket. –The ROC support groups should treat the users tickets as the COD ones.

25 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 25 About “old tickets” – IV MW SU tickets: What the TPM can or should do!! –Try to take a “Holistic” view for each MW SU. –There are many open tickets:  Check some of them if they are being worked on.  If there are long history of escalations, send a mail directly to the mailing list of the SU asking “politely” them about the ticket progress –If the number of tickets is small, you can check what is happening and write a comment in the internal diary asking for update or progress of the problem. –Check if the supporter tried to communicate with the user in the wrong way, through the “solution” or “internal diary” fields. If yes send a mail to the submitter. –In all cases, take special attention to “Waiting for reply” tickets,

26 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 26 About “old tickets” – V MW SU tickets What the TPM can or should do!! –Tickets which are related to bugs or enhancement of the MW components.  Open a ticket to the savannah https://savannah.cern.ch  Ask the support unit to do it. –Bugs in the MW are discussed in the Engineering Management Team (EMT). –Enhancements and new features on the MW components are discussed in the Technical Coordination Group (TCG).

27 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 27 About “old tickets” – VI MW SU tickets: What the TPM can or should do!! “unsolved”. –GGUS ticket with a reference to a savannah bug should be put into the state of “unsolved”. “solved” –The ticket should be set to “solved” only after the bug has been solved and introduced in the production release.  This action should be up to the support unit to do. –The TPM should not forget to put the email of the user on the “cc” field of the savannah bug  Send a mail to the user reporting this situation.

28 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 28 About “old tickets” – VII MW SU tickets What the TPM can or should do!!! –The TPM has no way to “enforce” SU to act on their tickets. –The TPM evaluates as best as possible the problem and make an effort to see if he can solve it without involving the supporter units. –Sometimes the new TPM on shift can give input on old problems, or can even re-assign the ticket back to “TPM”. Note that: We know that you will not be going to “read” all the tickets to see if you can do something.

29 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 29 Consequences and... The user has submitted a ticket and ticket is not acted upon for months, no communication with the user. –The user forgotten its ticket… –The user solves the problem by other means:  Talked to other people or the supporters themselves, and says to close the ticket, and you don't have a clue about the solution! –The user after submitting some tickets which are not acted upon, gives up using the system. –The problem simply and “auto-magically” disappeared, and you don't have a clue of what really happened

30 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 30... advices Please don't send a mail to the user saying: –“Is the problem solved? Can we close the ticket?” Because the user can reply: –“No, please don't close it yet: ” (ticket #8253) –“This ticket is not solved. It will be solved when...” (ticket #8818) –We don't want to ear people saying: “GGUS is eager to close the tickets”. You should ask: “What is the status or progress of this problem?” You can even discover that you can give some feedback on the problem. o An old ticket without any action either from the user or the support unit IS NOT equivalent to “is solved”.

31 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 31 Replicate & Master/Slave tickets You can duplicate tickets and assign them to multiple Support Units –Use the “Duplicate ticket”option. –Give some description in the internal diary which asks each SU to take care only of that piece of the ticket which concerns it. –Ask from each SU to notify TPM when it is done. You have the possibility to define master/slave tickets. –Normally, if you duplicate tickets to different SUs, the original ticket can be market as “Master” and the replica tickets can be tagged as “slaves”  This is not mandatory! –Each slave has to be referenced to its Master ticket –Use the master/slave relation fields in GGUS ticket interface –The parent ticket can not be solved without the child tickets being solved first

32 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 32 For the MW and Tools SU's – I assignedMW and Tools SU's gets tickets assigned from the TPM “In progress”The SU staff must change the status of the ticket to “In progress” –This means the SU has acknowledge the ticket and is starting to work on it The SU can assign the ticket to a given supporter “Assign ticket to specific person(s):” –Using the “Assign ticket to specific person(s):” field. “Public diary”If the SU needs more information from the user, the “Public diary” should be used; or reply to the email… “waiting for reply”.When the SU asks some information to the user, the status should be set to “waiting for reply”. o SHOULD In progressWhen the user replies, the SU SHOULD change the status again to “In progress”. This change has to be done manually.

33 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 33 For the MW and Tools SU's – II If the SU identifies that the ticket is a bug of the MW component: –The SU should open a bug in savannah (not the user), referencing the GGUS ticket and putting the user email in the “cc” field. –In the GGUS portal he should insert the savannah reference in the “Related issue” –Send a mail to the user explaining the situation –Change the status of the ticket to the “unsolved” state. –When the bug is solved in savannah, and when the patched component has been deployed in the production release, he should then go to the GGUS ticket and put it to “solved” state.

34 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 34 COD and ENOC tickets – I If you are TPM, you should be aware about tickets from: –CIC on duty – these are tickets opened to ROC's containing sites failling the SFT/SAM and Gstat periodic tests.  They start with: ----------Affected site:...  They end with: The EGEE CIC on Duty Team –ENOC – these are tickets reporting network problems.  Like: GEANT2 has a network incident located in... –New user tickets reporting site service problems  The TPM should be aware if there are already existing open tickets for that site from the COD.  There is the possibility to link related tickets.

35 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 35 COD and ENOC tickets – II If you are ROC support or site adminIf you are ROC support or site admin, you will surely want to worry about tickets in your region and/or site. If you are Middleware SUIf you are Middleware SU don't have to worry with anything of this except, if one of those tickets end up being assign to you.

36 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 36 Tools for TPM's Explore a bit the “Useful Links for Admin”, among them: SAM: https://lcg-sam.cern.ch:8443/sam/sam.pyhttps://lcg-sam.cern.ch:8443/sam/sam.py GOC DB: https://goc.gridops.org/https://goc.gridops.org/ –You get the status of the sites and... also the site names and to which ROC they belong. –Very nice and useful if you “Sort by Region”!! –“Is this site in downtime?” OSG production sites: http://is.grid.iu.edu/cgi-bin/status.cgihttp://is.grid.iu.edu/cgi-bin/status.cgi –If a site is not found in the GOC DB then search for this at the OSG production site Explore a bit the “General support/TPM information” at GGUS: https://gus.fzk.de/pages/info_for_supporters.php

37 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 37 FAQs - I Where is the schedule for TPMs? –The schedule can be found at the link:  https://gus.fzk.de/pages/tpm.php https://gus.fzk.de/pages/tpm.php How does the handover from one TPM team to the next TPM team happens? –On Mondays at 12:00 UTC –The mail address for the forwarding address of tickets assigned to TPM changes automatically to the next name in the schedule. –The TPM team simply generates a ticket in GGUS with a title such as: “ Handover of TPM from T3 SE team ” How can I determine which site is associated with a ROC? –Check the GOC DB or the OSG productions sites

38 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 38 FAQs - II How can I send a query to more than one ROCs? –If you receive a ticket which concerns multiple sites please duplicate the ticket n times and assign each ticket to the appropriate ROC. How shall I handle spam tickets? –There is a special value in field “Change ticket category” called “Spam”. –Change the ticket category to “Spam”. –Automatically the status of the ticket will change to “Solved” and the word “Spam” is inserted in the “Solution” field. –Spam tickets are deleted from the system after one week automatically. Is there any wiki available to TPMs –There is a wiki available to TPMs:  http://goc.grid.sinica.edu.tw/gocwiki/TPM

39 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The day-to-day work of a TPM 39 Contacts and help for you Questions about procedures: –support@ggus.orgsupport@ggus.org Technical support (every TPM is here): –tpm-grid-support@cern.chtpm-grid-support@cern.ch Help line: –+49 724 782 8383 (FZK-GGUS team)


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The day-to-day work of a TPM Gkamas Vasileios."

Similar presentations


Ads by Google